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Preface to the Third Edition 


For some forty years the first and second editions of this book have been 
used by students to acquire a basic knowledge of the theory and methods of 
multivariate statistical analysis. The book has also served a wider community 
of statisticians in furthering their understanding and proficiency in this field. 
Since the second edition was published, multivariate analysis has been 
developed and extended in many directions. Rather than attempting to cover, 
or even survey, the enlarged scope, I have elected to elucidate several aspects 
that are particularly interesting and useful for methodology and comprehen- 
sion. 

Earlier editions included some methods that could be carried out on ап 
adding machine! In the twenty-first century, however, computational tech- 
niques have become so highly developed and improvements come so rapidly 
that it is impossible to include all of the relevant methods in a volume on the 
general mathematical theory. Some aspects of statistics exploit computational 
power such as the resampling technologies; these are not covered here. 

The definition of multivariate statistics implies the treatment of variables 
that are interrelated. Several chapters are devoted to measures of correlation 
and tests of independence. A new chapter, “Patterns of Dependence; Graph- 
ical Models" has been added. A so-called graphical model is a set of vertices 
or nodes identifying observed variables together with a new set of edges 
suggesting dependences between variables. The algebra of such graphs is an 
outgrowth and development of path analysis and the study of causal chains. 
A graph may represent a sequence in time or logic and may suggest causation 
of one set of variables by another set. 

Another new topic systematically presented in the third edition is that of 
elliptically contoured distributions. The multivariate normal distribution, 
which is characterized by the mean vector and covariance matrix, has a 
limitation that the fourth-order moments of the variables are determined by 
the first- and second-order moments. The class of elliptically contoured 
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distribution relaxes this restriction. A density in this class has contours of 
equal density which are ellipsoids as does a normal density, but the set of 
fourth-order moments has one. further degree of freedom. This topic is 
expounded by the addition of sections to appropriate chapters. 

Reduced rank regression developed in Chapters 12 and 13 provides a 
method of reducing the number of regression coefficients to be estimated in 
the regression of one set of variables to another. This approach includes the 
limited-information maximum-likelihood estimator of an equation in a simul- 
taneous equations model. 

The preparation of the third edition has been benefited by advice and 
comments of readers of the first and second editions as well as by reviewers 
of the current revision. In addition to readers of the earlier editions listed in 


those prefaces I want to thank Michael Perlman and Kathy Richards for their 


assistance in getting this manuscript ready. 


T. W. ANDERSON 


Stanford, California 
February 2003 


Preface to the Second Edition 


Twenty-six years have passed since the first edition of this book was pub- 
lished. During that time great advances have been made in multivariate 
statistical analysis— particularly in the areas treated in that volume. This new 
edition purports to bring the original edition up to date by substantial 
revision, rewriting, and additions. The basic approach has been maintained, 
uamely, a mathematically rigorous development of statistical methods for 
Observations consisting of several measurements or characteristics of each 
subject and a study of their properties. The general outline of topics has been 
retained. 

The method of maximum likelihood has been augmented by other consid- 
erations. In point estimation of the mean vector and covariance matrix 
alternatives to the maximum likelihood estimators that are better with 
respect to certain loss functions, such as Stein and Bayes estimators, have 
been introduced. In testing hypotheses likelihood ratio tests have been 
supplemented by other invariant procedures. New results on distributions 
and asymptotic distributions are given; some significant points are tabulated. 
Properties of these procedures, such as power functions, admissibility, unbi- 
asedness, and monotonicity of power functions, are studied. Simultaneous 
confidence intervals for means and covariances are developed. A chapter on 
factor analysis replaces the chapter sketching miscellaneous results in the 
first edition. Some new topics, including simultaneous equations models and 
linear functional relationships, are introduced. Additional problems present 
further results. 

It is impossible to cover all relevant material in this book; what seems 
most important has been included. For a comprehensive listing of papers 
until 1966 and books until 1970 the reader is referred to А Bibliography of 
Multivariate Statistical Analysis by Anderson, Das Gupta, and Styan (1972). 
Further references can be found in Multivariate Analysis: А Selected and 
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Abstracted Bibliography, 1957-1972 by Subrahmaniam and Subrahmaniam 
(1973). 

Lam in debt to many students, colleagues, and friends for their suggestions 
and assistance; they include Yasuo Amemiya, James Berger, Byoung-Seon 
Choi, Arthur Cohen, Margery Cruise, Somesh Das Gupta, Kai-Tai Fang, 
Gene Golub, Aaron Han, Takeshi Hayakawa, Jogi Henna, Huang Hsu, Fred 
Huffer, Mituaki Huzii, Jack Kiefer, Mark Knowles, Sue Leurgans, Alex 
McMillan, Masashi No, Ingram Olkin, Kartik Patel, Michael Perlman, Allen 
Sampson, Ashis Sen Gupta, Andrew Siegel, Charles Stein, Patrick Strout, 
Akimichi Takemura, Joe Verducci, Marlos Viana, and Y. Yajima. I was 
helped in preparing the manuscript by Dorothy Anderson, Alice Lundin, 
Amy Schwartz, and Pat Struse. Special thanks go to Johanne Thiffault and 
George P. H. Styan for their precise attention. Support was contributed by 
the Army Research Office, the National Science Foundation, the Office of 
Naval Research, and IBM Systems Research Institute. 

Seven tables of significance points are given in Appendix B to facilitate 
carrying out test procedures. Tables 1, 5, and 7 are Tables 47, 50, and 53, 
respectively, of Biometrika Tables for Statisticians, Vol. 2, by E. S. Pearson 
and H. O. Hartley; permission of the Biometrika Trustees is hereby acknowl- 
edged. Table 2 is made up from three tables prepared by A. W. Davis and 
published in Biometrika (19702), Annals of the Institute of Statistical Mathe- 
matics (1970b) and Communications in Statistics, B. Simulation and Computa- 
tion (1980). Tables 3 and 4 are Tables 6.3 and 6.4, respectively, of Concise 
Statistical Tables, edited by Ziro Yamauti (1977) and published by the 
Japanese Standards Association; this book is a concise version of Statistical 
Tables and Formulas with Computer Applications, J&A-1972. Table 6 is Table 3 
of The Distribution of the Sphericity Test Criterion, ARL 72-0154, by B. N. 
Nagarsenker and К. C. S. Pillai, Acrospace Rescarch Laboratorics (1972). 
The author is indebted to the authors and publishers listed above for 
permission to reproduce these tables. 


T. W. ANDERSON 


Stanford. California 
June 1984 


Preface to the First Edition 


This book has been designed primarily as a text for a two-semester course in 
multivariate statistics. И is hoped that the book will also serve as ап 
introduction to many topics in this area to statisticians who are not students 
and will be used as a reference by other statisticians. 

For several years the book in the form of dittoed notes has been used in a 
two-semester sequence of graduate courses at Columbia University; the first 
six chapters constituted the text for the first semester, emphasizing correla- 
tion theory. It is assumed that the reader is familiar with the usual theory of 
univariate statistics, particularly methods based on the univariate normal 
distribution. A knowledge of matrix algebra is also a prerequisite; however, 
an appendix on this topic has been included. 

It is hoped that the more basic and important topics are treated here, 
though to some extent the coverage is a matter of taste. Some of the more 
recent and advanced developments are only briefly touched on in the late 
chapter. 

The method of maximum likelihood is used to a large extent. This leads to 
reasonable procedures; in some cases it can be proved that they are optimal. 
In many situations, however, the theory of desirable or optimum procedures 
is lacking. 

Over the years this manuscript has been developed, a number of students 
and colleagues have been of considerable assistance. Allan Birnbaum, Harold 
Hotelling, Jacob Horowitz, Howard Levene, Ingram Olkin, Gobind Seth, 
Charles Stein, and Henry Teicher are to be mentioned particularly. Acknowl- 
edgements are also due to other members of the Graduate Mathematical 
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Statistics Society at Columbia University for aid in the preparation of the 


manuscript in dittoed form. The preparation of this manuscript was sup- 
ported in part by the Office of Naval Research. 


T. W. ANDERSON 


Center for Advanced Study 

in the Behavioral Sciences 
Stanford, California 
December 1957 


CHAPTER 1 


Introduction 


1.1. MULTIVARIATE STATISTICAL ANALYSIS 


Multivariate statistical analysis is concerned with data that consist of sets of 
measurements on a number of individuals or objects. The sample data may 
be heights and weights of some individuals drawn randomly from a popula- 
tion of school children in a given city, or the statistical treatment may be 
made on a collection of measurements, such as lengths and widths of petals 
and lengths and widths of sepals of iris plants taken from two species, or one 
may study the scores on batteries of mental tests administered to a number of 
students. 

The measuremenis made on a single individual can be assembled into a 
column vector. We think of the entire vector as an observation from a ` 
multivariate population or distribution. When the individual is drawn ran- 
domly, we consider the vector as a random vector with a distribution от 
probability law describing that population. The set of observations on all 
individuals in a sample constitutes a sample of vectors, and the vectors set 
side by side make up the matrix of observations.’ The data to be analyzed 
then are thought of as displayed in a matrix or in several matrices. 

We shall see that it is helpful in visualizing the data and understanding the 
methods to think of each observation vector as constituting a point in a 
Euclidean space, each coordinate corresponding to a measurement or vari- 
able. Indeed, an early step in the statistical analysis is plotting the data; since 


tWhen data are listed on paper by individual, it is natural to print the measurements on one 
individual as a row of the table; then one individual corresponds to a row vector. Since we prefer 
to operate algebraically with column vectors, we have chosen to treat observations in terms of 
column vectors. (In practice, the basic data set may well be on cards, tapes, or disks.) 
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most statisticians are limited to two-dimensional plots, two coordinates of the 
Observation are plotted in turn. 

Characteristics of a univariate distribution of essential interest are the 
mean as a measure of location and the standard deviation as a measure of 
variability; similarly the mean and standard deviation of a univariate sample 
are important summary measures. In multivariate analysis, the means and 
variances of the separate measurements—for distributions and for samples 
— have corresponding relevance. An essential aspect, however, of multivari- 
ate analysis is the dependence between the different variables, The depen- 
dence between two variables may involve the covariance between them, that 
is, the average products of their deviations from their respective means. The 
covariance standardized by the corresponding standard deviations is the 
correlation coefficient; it serves as a measure of degree of dependence. A set 
of summary statistics is the mean vector (consisting of the univariate means) 
and the covariance matrix (consisting of the univariate variances and bivari- 
ate covariances) An alternative set of summary statistics with the same 
information is the mean vector, the set of standard deviations, and the 
correlation matrix. Similar parameter quantities describe location, variability, 
and dependence in the population or for a probability distribution. The 
multivariate normal distribution is completely determined by its mean vector 
and covariance matrix, and the sample mean vector and covariance matrix 
constitute a sufficient set of statistics. 

The measurement and analysis of dependence between variables, between 
sets of variables, and between variables and sets of variables are fundamental 
to multivariate analysis. The multiple correlation coefficient is an extension 
of the notion of correlation to the relationship of one variable to a set of 
variables. The partial correlation coefficient is a measure of dependence 
between two variables when the effects of other correlated variables have 
been removed. The various correlation coefficients computed from samples 
are used to estimate corresponding correlation coefficients of distributions. 
In this book tests of hypotheses of independence are developed. The proper- 
ties of the estimators and test procedures are studied for sampling from the 
multivariate normal distribution. 

A number of statistical problems arising in multivariate populations are 
straightforward analogs of problems arising in univariate populations; the 
suitable methods for handling these problems are similarly related. For 
example, in the univariate case we may wish to test the hypothesis that the 
mean of a variable is zero; in the multivariate case we may wish to test the 
hypothesis that the vector of the means of several variables is the zero vector. 
The analog of the Student :-test for the first hypothesis is the generalized 
T?.test. The analysis of variance of a single variable is adapted to vector 
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Observations; in regression analysis, the dependent quantity may be a vector 
variable. A comparison of variances is generalized into a comparison of 
covariance matrices. | 

The test procedures of univariate statistics are generalized to the multi- 
variate case in such ways that the dependence between variables is taken into 
account. These methods may not depend on the coordinate system; that is, 
the procedures may be invariant with respect to linear transformations that 
leave the null hypothesis invariant. In some problems there may be families 
of tests that are invariant; then choices must be made. Optimal properties of 
the tests are considered. 

For some other purposes, however, it may be important to select a 
coordinate system so that the variates have desired statistical properties. One 
might say that they involve characterizations of inherent properties of normal 
distributions and of samples. These are closely related to the algebraic 
problems of canonical forms of matrices. An example is finding the normal- 
ized linear combination of variables with maximum or minimum variance 
(finding principal components); this amounts to finding a rotation of axes 
that carries the covariance matrix to diagonal form. Another example is 
characterizing the dependence between two sets of variates (finding canoni- 
cal correlations). These problems involve the characteristic roots and vectors 
of various matrices. The statistical properties of the corresponding sample 
quantities are treated. 

Some statistical problems arise in models in which means and covariances 
are restricted. Factor analysis may be based on a model with a (population) 
covariance matrix that is the sum of a positive definite diagonal matrix and a 
positive semidefinite matrix of low rank; linear structural relationships may 
have a similar formulation. The simultaneous equations system of economet- 
rics is another example of a special model. 


1.2. THE MULTIVARIATE NORMAL DISTRIBUTION 


The statistical! methods treated in this book can be developed and evaluated 
in the context of the multivariate normal distribution, though many of the 
procedures are useful and effective when the distribution sampled is not 
normal. А major reason for basing statistical analysis on the normal distribu- 
tion is that this probabilistic model approximates well the distribution of 
continuous measurements in many sampled populations. In fact, most of the 
methods and theory have been developed to serve statistical analysis of data. 
Mathematicians such as Adrian (1808), Laplace (1811), Plana (1813), Gauss 
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(1823), and Bravais (1846) studied the bivariate normal density. Francis 
Galton, the geneticist, introduced the ideas of correlation, regression, and 
homoscedasticity in the study of pairs of measurements, one made on a 
parent and оле in an offspring. [See, e.g., Galton (1889).] He enunciated the 
theory of the multivariate normal distribution as a generalization of observed 
properties of semples. 

Karl Pearson and others carried on the development of the theory and use 
of different kinds of correlation coefficients! for studying problems in genet- 
ics, biology, and other fields. R. A. Fisher further developed methods for 
agriculture, botany, and anthropology, including the discriminant function for 
classification problems. In another direction, analysis of scores ол mental 
tests led to a theory, including factor analysis, the sampling theory of which is 
based on the normal distribution. In these cases, as well as in agricultural 
experiments, in engineering problems, in certain economic problems, and in 
other fields, the multivariate normal distributions have been found to be 
sufficiently close approximations to the populations so that statistical analy- 
ses based on these models are justified. 

The univariate normal distribution arises frequently because the effect 
studied is the sum of many independent random effects. Similarly, the 
multivariate normal distribution often occurs because the multiple measure- 


ments are sums of small independent effects. Just as the central limit . 


theorem leads to the univariate normal distribution for single variables, so 


does the general central limit theorem for several variables lead to the É 


multivariate normal distribution. | 
Statistical theory based on the normal distribution has the advantage that 
the multivariate methods based on it are extensively developed and can be 
studied in an organized and systematic way. This is due not only to the need 
for such methods because they are of practical usc, but also to the fact that 
normal theory is amenable to exact mathematical treatment. The suitable 


methods of analysis are mainly based on standard operations of matrix. 


algebra; the distributions of many statistics involved can be obtained exactly 
or at least characterized; and in. many cases optimum properties of proce- 
dures can be deduced. 

The point of view in this book is to state problems of inference in terms of 
the multivariate normal distributions, develop efficient and often optimum 
methods in this context, and evaluate significance and confidence levels in 
these terms. This approach gives coherence and rigor to the exposition, but, 
by its very nature, cannot exhaust consideration of multivariate stz tistical 
analysis. The procedures are appropriate to many nonnormal distributions, 


tFor a detailed study of the development of the ideas of correlation, see Walker (1931). 
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but their adequacy may be open to question. Roughly speaking, inferences 
about means are robust because of the operation of the central limit 
theorem, but inferences about covariances are sensitive to normality, the 
variability of sample covariances depending on fourth-order moments. 

This inflexibility of normal methods with respect to moments of order 
greater than two can be reduced by including a larger class of elliptically 
contoured distributions. In the univariate case the normal distribution is 
determined by the mean and variance; higher-order moments and properties 
such as peakedness and long tails are functions of the mean and variance. 
Similarly, in the multivariate case the means and covariances or the means, 
variances, and correlations determine all of the properties of the distribution. 
That limitation is alleviated in one respect by consideration of a broad class 
of elliptically contoured distributions. That class maintains the dependence 
structure, but permits more general peakedness and long tails. This study 
leads to more robust methods. 

The development of computer technology has revolutionized multivariate 
statistics in several respects. As in univariate statistics, modern computers 
permit the evaluation of observed variability and significance of results by 
resampling methods, such as the bootsirap and cross-validation. Such 
methodology reduces the reliance on tables of significance points as well as 
eliminates some restrictions of the normal distribution. 

Nonparametric techniques are available when nothing is known about the 
underlying distributions. Space does not permit inclusion of these topics as 
wel! as other considerations of data analysis, such as treatment of outliers 
and ‘transformations of variables to approximate normality and homoscedas- 
ticity. . | 

The availability of modern computer facilities makes possible the analysis 
of large data sets and that ability permits the application of multivariate 
methods to new areas, such as image analysis, and more effective analysis of 
data, such as meteorological. Moreover, new problems of statistical analysis 
arise, such as sparseness of parameter or data matrices. Because hardware 
and software development is so explosive and programs require specialized 
knowledge, we are content to make a few remarks here and there about 
computation. Packages of statistical programs are available for most of the 
methods. 


CHAPTER 2 


The Multivariate 
Normal Distribution 


2.1. INTRODUCTION 


In this chapter we discuss the multivariate normal distribution and some of 
its properties. In Section 2.2 are considered the fundamental notions of 
multivariate distributions: the definition by means of multivariate density 
functions, marginal distributions, conditional distributions, expected values, 
and moments. In Section 2.3 the multivariate normal distribution is defined; 
the parameters are shown to be the means, variances, and covariances or the 
means, variances, and correlations of the components of the random vector. 
In Section 2.4 it is shown that linear combinations of normal variables are 
normally distributed and hence that marginal distributions are normal, In 
Section 2.5 we see that conditional distributions are also normal with means 
that are linear functions of the conditioning variables; the coefficients are 
regression coefficients. The variances, covariances, and correlations— called 
partial correlations—are constants. The multiple correlation coefficient is 
the maximum correlation between a scalar random variable and linear 
combination of other random variables; it is a measure of association be- 
tween one variable and a set of others. The fact that marginal and condi- 
tional distributions of normal distributions are normal makes the treatment 
of this family of distributions coherent. In Section 2.6 the characteristic 
function, moments, and cumulants are discussed. In Section 2.7 elliptically 
contoured distributions are defined; the properties of the normal distribution 
are extended to this larger class of distributions. 


«4л Introduction to Multivariate Statistical Analysis, Third Edition. Ву T. W. Anderson 
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2.2. NOTIONS OF MULTIVARIATE DISTRIBUTIONS 


2.2.1. Joint Distributions 


In this section we shall consider the notions of joint distributions of several 
variables, derived marginal distributions of subsets of variables, and derived 
conditional distributions. First consider the case of two (real) random 
variables! X and Y. Probabilities of events defined in terms of these variables 
can be obtained by operations involving the cumulative distribution function 
(abbreviaied as cdf), 


(1) F(x, y) =Pr{X xx, Y xy), 


defined for every pair of real numbers (x, y). We are interested in cases 
where F(x, y) is absolutely continuous; this means that the following partial 
derivative exists almost everywhere: 


. 8?F(x, 
Q) USD уба, у), 
апа 
(3) Е(х,у) = ff fee) du dv. 


The nonnegative function f(x, y) is called the density of X and Y. The pair 
of random variables (.X, Y ) defines a random point in a plane. The probabil- 
ity that (X,Y) falls in a rectangle is 
(4) Pr(xX xxt Ax,y Y «y t Ay) 
SF(xtAx,ytAy)-F(xtAx,y) -— F(x, y c Ay) t F(x,y) 
- f") dudv 
y 


x 


(Ax » 0, Ay » 0). The probability of the random point CX, У) falling in any 
set E for which the following int»gral is defined (that is, any measurable set 
E) is 


(5) Р(х,у) € E) = f f f(xy) dedy. 


"In Chapter 2 we shall distinguish between random variables and running variables by use of 
capital and lowercase letters, respectively. In later chapters we may be unable to hold to this 
convention because of other complications of notation. 
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This follows from the definition of the integral [^s the limit of sums of the 
sort (4). If f(x, y) is continuous in both variables, the probability element 
f(x,y) Ay Ax is approximately the probability that X falls between x and 
x+ Ax and Y falls between y and y + Ay since 


(6 Pr(x<X<xt+Ax,y<Y¥<y+Ay} = f" рио) dudv 
y 


x 
= (хо, yo) Ax Ay 


for some ху, Yo (x Xx x x + Ax, y €yo <y - Ay) by the mean value theo- 
rem of calculus. Since f(u, v) is continuous, (6) is approximately f(x, у) Ax A y. 
In fact, 


. 1 
(7) Jim, Keay Pré SX xx Ax, y s Y xy + Ay) 
Ау->0 


— f(x, y) Ax Ay| =0. 


Now we consider the case of p random variables X, Xy,..., Xp The 
cdf is 
(8) F(x,,..., Xp) = Pr{X, xx,,..., X, xx] 
defined for every set of real numbers x,,...,x,. The density function, if 


F(x,,...,%,) is absolutely continuous, is 


OPF(x,,...,x,) 
(9) CAS =f(%15-+-) Xp) 


(almost everywhere), and 
Xp Ы 
(10) F(x,..., xy) =f” e f Као, dns ~- duty. 


The probability of falling in any (measurable) set R in the p-dimensional 
Euclidean space is 


(ш) P(X poe) ER} = f s ffs) dede 


The probability element f(x,,...,x,) Ax, © Ax, is approximately the prob- 
ability Pr(x, < X, xx, t Ax... x, SX, xx, t Ax,) if f(x,...,x,) is 
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continuous. The joint moments are defined as‘ 


oo oo 
(12) exh х= || ee xto f(X Xp) dx, 7 dz, 


2.2.2. Marginal Distributions 


t. Given the cdf of two random variables X, Y as being F(x, y), the marginal 


cdf of X is 

(13) Pr(X <x} = Pr(X <x, Y < oo) 
= F(x,0). 

Let this be F(x). Clearly 

(14) Р(х) = fo f о) дойи. 

We call 

(15) f feno) do и), 


say, the marginal density of X. Then (14) is 


(16) F(x) = f Jo du. 


In a similar fashion we define G(y), the marginal cdf of Y, and g(y), the 
marginal density of Y. 

Now we turn to the general case. Given Ель...) as the cdf of 
Xy... Xp we wish to find the marginal cdf of some of X,,..., Хр, say, of 
X,,...,X, (r <p). It is 
(17) Pr( X, €x,,..., X, &x,) 

-Pr(X, €x;,..., X, €x,, X,,1 «09,..., X, < 00} 


-F(x,...,X,,00,...,00). 


The marginal density of X,,..., X, is 
| oo oo 
(18) f ed etes) du,,, 7 дир. 


1$ will be used to denote mathematical expectation. 
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The marginal distribution and density of any other subset of X;,..., X, are 
obtained in the obviously similar fashion. 


The joint moments of a subset of variates can be computed. from the 
marginal distribution; for example, 


(19) EXP РХ. KEK. X) 


x oc 
=f =f xp x fnis xy) dx, 7 dx, 
-X -X 


2.2.3. Statistical Independence 
Two random variables X, Y with cdf F(x, y) are said to be independent if 
(20) F(x,y) =F(x)G(y), 


where F(x) is the marginal cdf of X and G(y) is the marginal cdf of Y. This 
implies that the density of X, Y is 


Е 2 
(21) fy) = EY» PAVE) 


_ dF(x) dG(y) 
ах dy 


-f()g(). 
Conversely, if f(x, y) = f(x)gCy), then 


Q2)  F(xy)- ff o. v) dudv = ff fost) dudv 


- fa) аи, g) do=F(x)G(y). 


Thus an equivalent definition of independence in the case of densities 
existing is that f(x,y)=f(x)g(y). To see the implications of statistical 
independence, given any x, <x, y, < y;, we consider the probability 


(23)  Pri(xi Xx, y SY xy) 
У? 
= du dv = d d 
Јао) ао Јл) auf во) ао 


= Рг(х, < X <x,} Pr{y, <Y xyj). 
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The probability of X falling in a given interval and Y falling in a given 
interval is the product of the probability of X falling in the interval and the 
probability of Y falling in the other interval. 


If the ccf of Х,,..., X, is F(x,,...; хз), the set of random variables is said 
to be mutually independent if 
(24) F(x,,...,x,) = Е (хи) F (xy). 
where F;(x;) is the marginal cdf of X, і = 1,..., p. The set X,,..., X, is said 
to be independent of the set X,, ,,..., X, if 


р 


(25) Х(>;,...,х,) = Е(ху,...,х,,оо,...,оо):Е(оо,...,00,х,,,,. »X,). 


One result of independence is that joint moments factor. For example, if 
Хь.... X, are mutually independent, then 


(26) exhi - . [ff “Г”. xh. о (х), (xs) dx, 7 dx, 
P^ 
= hif( x. А 
ПД Сх) ds 
р. 
= П (х). 


i=1 


2.2.4. Conditional Distributions 


If A and B are two events such that the probability of A and B occurring 
simultaneously is P(AB) and the probability of B occurring is P(B) » 0, 
then the conditional probability of A occurring given that B has occurred is 
P(AB)/P(B). Suppose the event A is X falling in the interval [x,, x,] and 
the event B is Y falling in [y,, y;]. Then the conditional probability that X 
falls in [x,, x5], given that Y falls in [y,, y3], is 


Pr{x,<X <x, y, Y xy, 


(27)  Pr(x,Xxxjy, Y xy) Pr», x Y xy;] 


Now let y, =y, y; =у + Ду. Then for a continuous density, 


(28) "во do= g^) dy, 
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where y <у* x y * Ay. Also 
YAY a c : 

(29) [f(s в) do= flu, y*(u)] Ay; 
y 

where y <y*(u) xy + Ay. Therefore, 


(30) Pris ауузуну) = [EEO а, 


It will be noticed that for fixed y and Ay (> 0), the integrand of (30) behaves 
as a univariate density function. Now for y such that g(y)>0, we define 
Pr(x, < X & v,lY =y}, the probability that X lies between x, and хз, given 
that Y is y, as the limit of (30) as Ay — 0. Thus 


(31) Pr(x, < X <x, Y - y] = f fuv) du, 


where f(uly) = Ки, y)/g(y). For given y, f(uly) is a density function and is 
called the conditional density of X given y. We note that if X and Y are 
independent, f(xly) = f(x). 

In the general case of X,,..., X, with cdf F(x,,...,x,), the conditional 


density of X,,..., X,, given X,,, 9x,,1,..., Xp 7 x, IS 


f(x sx) 


“ . 
fof FU jae Ups xai Xp) eg ди, 
— © 


(32) 


For a more general discussion of conditional probabilities, the reader is 
referred to Chung (1974), Kolmogorov (1950), Loéve (1977),(1978), and 
Neveu (1965). 


2.2.5. Transformation of Variables t 


Let the density of X,,..., X, be f(x,,...,x,). Consider the р real-valued 


functions 
(33) yim yis Xp), i=1,...,p. 


We assume that the transformation from the x-space to the y-space is 
one-to-one;' the inverse transformation is 


(34) X= XC У»... yp): i=1,...,p. 


1 Моге precisely, we assume this is true for the part of the x-space for which fon. xy) is 
positive. 
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Let the random variables Y;,...,Y, be defined by 

(35) Y, y (Xy... Xp) i=1,...,p. 
Then the density of Y,,...,Y, is 

(36) g(yi Y) 2f [nOn Yp) xy xn Ур), 


where J(y,,..., Yp) is the Jacobian 


9х; Ox, 9х; 
дур у) aY, 
9 95 OM 
9 
(37) Журу) = mod) ^». 9X: дур |. 
OX, 9х, OX, 
dy, у; 9ур 


We assume the derivatives exist, and “mod” means modulus or absolute value 
of the expression following it. The probability that (X,,...,X,) falls in a 
region Ё is given by (11); the probability that (Y,,..., Ү,) falls in a region 5 is 


(38) Pr{(¥,,.--.¥,) e S] =f u Јар) dy, dy. 


If S is the transform of R, that is, if each point of R transforms by (33) into a 
point of S and if each point of 5 transforms into R by (34), then (11) is equal 
to (38) by the usual theory of transformation of multiple integrals. From this 
follows the assertion that (36) is the density of Yj,..., Y. 


2.3. THE MULTIVARIATE NORMAL DISTRIBUTION 


The univariate normal density function can be written 
(1) К кет 202-8 = Ке Hx Bale В), 


where a is positive and К is chosen so that the integral of (1) over the entire 
x-axis is unity. The density function of a multivariate normal distribution of 
Xy... X, has an analogous form. The scalar variable x is replaced by a 
vector 


х] 
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the scalar constant В is replaced by a vector 


(3) b-|:5 


and the positive constant a is replaced by a positive definite (symmetric) 
matrix 


а: ay аур 

азу an азр 
(4) A-| . 

арр Ap арр 


The square a(x — В) = (x ~ B)a(x — B) is replaced by the quadratic form 
р 
(5) (x-b)'A(x-b)- У aj(x; - b)(x; - bj). 
Ву 


Thus the density function of a p-variate normal distribution is 
(6) flx Xp) = Ке 27546), 


where К (> 0) is chosen so that the integral over the entire p-dimensional 
Euclidean space of x,,..., x, is unity. 

Written in matrix notation, the similarity of the multivariate normal 
density (6) to the univariate density (1) is clear. Throughout this book we 
shall use matrix notation and operations. Th: reader is referred to the 
Appendix for a review of matrix theory and for definitions of our notation for 
matrix operations. 

We observe that f(x,,..., xj) is nonnegative. Since А is positive definite, 


(7) (x—-b)'A(x—b) x0, 
and therefore the density is bounded; that is, 
(8) f(x xj) SK. 


Now let us determine К so that the integral of (6) over the p-dimensional 
space is one. We shall evaluate 


x 2c 
(9) K*=f =f g^ Facta D dy, s dy, 


x 
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We use the fact (see Corollary A.1.6 in the Appendix) that if A is positive 
definite, there exists a nonsingular matrix C such that 


(10) | С'АС=1, 


where I denotes the identity and С’ the transpose of С. Let 


(11) x—b = Су, 
where 
| yı 
(12) у= |: 
Ур 
Then 
(13) (x - b)'A(x- b) =y'C’ACy - yy. 


The Jacobian of the transformation is 
(14) J = mod|C], 


where modlC| indicates the absolute value of the determinant of C. Thus (9) 
becomes 


oo oo 
(15) * = mod|C| f. fe dy, 7 dy. 
We have 
" 1 4 p 
(16) e = exp| — 5 E = Пе», 
i=1 i=] 


where exp(z) = е?. We can write (15) as 
17 * Ге... 0-5 
(17) K* = mod|C| f. bi sen b dy, dy, 
p 


= mod|C| П (J e 5? ay} 


і=1 
р 


= mod|C| [T(V2) 


i=] 


= mod |C] (2)? 
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by virtue of 

(18) ale dt =1. 

Corresponding to (10) is the determinantal equation ; 
(19) ІС" -|A] IC] = ul. 

Since 

(20) СЧ «ICI, 


and since |Z| = 1, we deduce from (19) that 


(21) = mod|C| = 1/ МГ. 
Thus 
(22) K =1/K* = VA (2r) ”. 


The normal density function is 


Jal 


e» (ony 


e^ lax-bYAQ-5) 


We shall now show the significance of № and A by finding tne first and . 


second moments of Xy. X, It will be convenient to consider these 
random variables as constituting a random vector 


(24) X- 


We shall define generally a random matrix and the expected value of a 
random matrix; a random vector is considered as a special case of a random 
matrix with one column. 

Definition 2.3.1. А random matrix Z is a matrix 


(25) Z (Z4) 8=1,...т, h=1,...,n, 


of random variables Z,,,...,Z 


23 THE MULTIVARIATE NORMAL DISTRIBUTION 17 


If the random variables Z,,,...,Z,,, can take on only a finite number of 
values, the random matrix Z can be one of a finite number of matrices, say 
Z(1),..., Z(q). If the probability of Z = Z(i) is р, then we should like to 
define ZZ as Y7.,Z()5. Then 4Z-(42,,) If the random variables 
Zy..,Z,, have a joint density, then by operating with Riemann sums we 
can define 22 as the limit (if the limit exists) of approximating sums of the. 
kind occurring in the discrete case; then again £Z =(&Z,,). Therefore, in 
general we shall use the following definition: 


Definition 2.3.2. The expected value of a random matrix Z is 
(26) éZ-(6Z,4) g=l,....m, hel...n. 
In particular if Z is X defined by (24), the expected value 


éX, 
(27) éX= 


is the mean or mean vector of X. We shall usually denote this mean vector by 
p. If Z is (X — ВХ — р)’, the expected value is 


(28)  €(X)-é(X-w)(X-9)' = [6(X;- n)(X;- uj). 


the covariance or covariance matrix of X. The ith diagonal element of this 
matrix, A(X; — p), is the variance of Х,, and the i, jth off-diagonal ele- 
ment, &(X;— в ХХ, — р), is the covariance of X; and ХУ, i#j. We shall 
usually denote the covariance matrix by €. Note that 


(29) _7(Х) = $ (ХХ'- pX' -Xw + pp’) = ХХ — pp’. 


The operation of taking the expected value of a random matrix (or vector) 
satisfies certain rules which we can summarize in the following lemma: 


Lemma 2.3.1. If Z is an m x n random matrix, D is an l X m real matrix, 
E is an n Xq real matrix, and F is an l X q real matrix, then 


(30) (DZE * F) -D( £Z)E +F. 
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Proof. The element in the ith row and jth column of (DZE + Е) is 


(31) el E dinZngêgj 2 = E dinl E ng )е,; + fij 
h, g 


h.g 
which is the element in the ith row and jth column of D(ZZ)E +F. 
-E 


Lemma 2.3.2. If Y= рх +f, where X is a random vector, then 
(32) EY=DEX+f, 
(33) €(Y)-Dé(X)D'. 


Proof. The first assertion follows directly from Lemma 2:3.1, and the 
second from 


(34)  €(Y)-4(Y- @Y)(Y- 6У)' 
= &[DX+f—(Dé&X+f)][DX+f-(DéEX+f)]’ 
= &[D(X- €X)|[D(X- ёх) 
-é[D(X- EXXX- €X)'D'], 

which vields the right-hand side of (33) by Lemma 2.3.1. a 


When the transformation corresponds to (11), that is, X = CY + b, then 
£X — C £Y +b. By the transformation theory given in Section 22, the density 
of Y is proportional to (16); that is, it is 


The expected value of the ith component of Y is 


x ос р Lo 
(36) eY, = Jf Pul [ев + dy, 


-9 j=l 

1 = "alf amete 
= LL 7 9 dy. e | 
- == є ve УТ] Г. 5 y; 

j*i 

1 T -D dy. 
== J ve dy; 
= 0). 
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The last equality follows because! y;e^ »? is an odd function of y, Thus 
4 Y = 0. Therefore, the mean of X, denoted by p, is 


(37) и = @ХЕБ. 


From (33) we see that &(X) = C(£YY')C'. The i, jth element of ZYY' is 


i о о р 1 La 
(38) eYx,- [^ — у [ize s dy, 


й=1 


because the density of Y is (35). If i=j, we have 


1 со on p 
(39) ФУ? = Az ie V dy, H | 


The last equality follows because the next to last expression is the expected 
value of the square of a variable normally distributed with mean 0 and 
variance 1. If i +, (38) becomes 


(40) EYY = a [ye i dy. Lf” ye tla 
DEC ag agn AU Az d у; 


=0, іж}, 


since the first integration gives 0. We can summarize (39) апа (40) as 


(41) EYY - I. 
Thus 
(42) #(Х-в)(Х- в)’ = CIC' = CC'. 


From (10) we obtain А = (С’)-'С7' by multiplication by (С’)-! on the left 
and by C^! on the right. Taking inverses on both sides of the equality 


‘Alternatively, the last equality follows because the next to last expression is the expected value 
of a normally distributed variable with mean 0. 
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gives us 
(43) | CC =A. 


Thus, the covariance matrix of X is 


(44) L=&(X-p)(X-p)' -A'!. 
From (43) we see that X is positive definite. Let us summarize these results. 


Theorem 2.3.1. If the density of a p-dimensional random vector X is (23), 
then the expected value of X is b and the covariance matri. is Аі. Conversely, 
given a vector p and a positive definite matrix X,, there is a multivariate normal 
density 


(45) (2) || ec fexta) 
such that the expected value of the vector with this density is р, and the covariance 
matri is X. 


We shall denote the density (45) as n(x| p, 2) and the distribution law as 
Миу, X). 

The ith ciagonal element of the covariance matrix, oj, is the variance of 
the ith component of X; we may sometimes denote this by oj. The 
correlation coefficient between X; and X, is defined as 


(46 р. 


This measure of association is symmetric in X; and X;: р;; = pj. Since 


2 


| Og 01} 9i 9; 9; Dij 
(47) = , 
9g Cj 9:9; pij 9j 


is positive definite (Corollary A.1.3 of the Appendix), the determinant 


И a ary 52 
(48) acp of | "4 (1-00) 
ij РЕЈ 


is positive. Therefore, —1 <р; < 1. (For singular distributions, see Section 
2.4.) The multivariate normal density can be parametrized by the means щ, 
i=1,...,p, the variances o? ] 
i,j=l,...,p. 


, i 1..., p, and the correlations pj, i<j 
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As a special case of the preceding theory, we consider the bivariate normal 
distribution. The mean vector is 


X, By 
49 ё = ; 
(49) x H 
the covariance matrix may be written 
Xi — ш) X, — в!) (Х,- 
(50) =e (Xi 7 m) ( 1 i) 2 , ш) 
(X; — из) (М-ва) (Х, - m) 
= Oy Fp - д? 905 р 
95 95 99 р aj 


where ср is the variance of X,, a the variance of X,, and p the 


correlation between X, and X,. The inverse of (50) is 


1 р 
2 9,0. 
1 Ti 102 
-1 = 
(51) > i-p?|__p a 
91905 92 


The density function of X, and X, is 


(32) 


1 exp! — 1 Е 
2m010,y1-— р? Ф 2(1- p?) с? 
2р0 = ш)(х2 = m) + (x; - ui) |. 


910; 0 


Theorem 2.3.2. The correlation coefficient p of any bivariate distribution is 
invariant with respect to transformations X? — bj X, +c; bj» 0, і= 1,2. Every 


-function of the parameters of a bivariate normal distribution that is invariant with 


respect to sueh transformations is a function of p. 


X? is b\b,0,0,p by Lemma 2.3.2. Insertion of these values into the 
definition of the correlation between Xj and X7 shows that it is p. If 
Км, из, оу, 05, p) is invariant with respect to such transformations, it must 
be f(0,0, 1, 1, р) by choice of b; = 1/0; and c; = —y,/o;, i= 1,2. u 


Proof. The variance of X is Б20;2, i = 1,2, and the covariance of X7 and 
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The correlation coefficient p is the natural measure of association between 
X, and Х,. Any function of the parameters of the bivariate normal distribu- 
tion that is independent of the scale and location parameters is a function of 
p. The standardized variable (or standard score) is Y, = (X; — u)/a;. The 
mean squared difference between the two standardized variables is 


(33) &(y,- Y: 220 - р). 


The smaller (53) is (that is, the larger p is), the more similar Y, and Y, are. If 
р> 0, X, and X, tend to be positively related, and if p< 0, they tend to be 
negatively related. If p — 0, the density (52) is the product o: the marginal 
densities of X, and X,; hence X, and X, are independent. 

It will be noticed that the density function (45) is constant on ellipsoids 


(54) (х= р) (х-ы) =с 


for every positive value of c in a p-dimensional Euclidean space. The center 
of each ellipsoid is at the point в. The shape and orientation of the ellipsoid 
are determined by X, and the size (given X) is determined by c. Because (54) 
is a sphere if X = 027, nil в, o 71) is known as a spherical normal density. 

Let us consider in detail the bivariate case of the density (52). We 
transform coordinates by (x; — ш) /о; = y; i = 1,2, so that the centers of the 
loci of constant density are at the origin. These loci are defined by 


1 
1-р? 


(55) (xi -2pyiy4 +у3) = с. 


The intercepts on the y,-axis and y,-axis are equal. If о> 0, the major axis of 
the ellipse is along the 45? line with a length of 2yc(1- p) , and the minor 
axis has a length of 2y/c(1 — p) . If p < 0, the major axis is along the 135? line 
with a length of 2/c(1— p) , and the minor axis has a length of 2yc(1-* p). 
The value of p determines the ratio of these lengths. In this bivariate case we 
can think of the density function as a surface above the plane. The contours 
of equal density are contours of equal altitude on a topographical map; thev 
indicate the shape of the hill (or probability surface). If р> 0, the hill will 
tend to run along a line with a positive slope; most of the hill will be in the 
first and third quadrants. When we transform back to x,— о;у; + uj, We 
expand each contour by a factor of о; in the direction of the ith axis and 
shift the center to (ри, р). 
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The numerical values of tlie cdf of the univariate normal variable are 
obtained from tables found in most statistical texts. The numerical values of 


(56) F(x, x;) = Pr( XY, x,, X <x} 


X,— — 
-p| He Xy, X; — Me <»), 


Xy, < 
91 9 


where y, = (х — ш)/ о and у, = (х; — р,)/о,, can be found in Pearson 
(1931). An extensive table has been given by the National Bureau of Stan- 


dards (1959). A bibliography of such tables has been given by Gupta (1963). 
Pearson has also shown that 


oo 


(57) F(x,x;) = У р/т (у) (у), 


j-0 


where the so-called tetrachoric functions т(у) are tabulated in Pearson (1930) 
up to tT,9(y). Harris and Soms (1980) have studied generalizations of (57). 


2.4, THE DISTRIBUTION OF LINEAR COMBINATIONS OF 
NORMALLY DISTRIBUTED VARIATES; INDEPENDENCE 
OF VARIATES; MARGINAL DISTRIBUTIONS 


One of the reasons that the study of normal multivariate distributions is so 
useful is that marginal distributions and conditional distributions derived 
from multivariate normal distributions are also normal distributions. More- 
over, linear combinations of multivariate normal variates are again normally 
distributed. First we shall show that if we make a nonsingular linear transfor- 
mation of a vector whose components have a joint distribution with a normal 


density, we obtain a vector whose components are jointly distributed with a 
normal density. 


Theorem 2.4.1. Let X (with p components) be distributed according to 
Миу, У). Then 


(1) Y-CX 
is distributed according to N(C x, CX C") for С nonsingular. 


Proof. The density of Y is obtained from the density of X, n(x|p, X), by 
replacing x by | 


(2) x-Cy, 
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and multiplying by the Jacobian of the transformation (2), 


fT |] HM, ., I 
Ice о Те 1х СТ T 1сУс 


The quadratic form іп the exponent of п(х|џ, X) is 


—1 = 
(3) mod|C"!! аг = 


(4) 0= (х-в)' 5 (х-ы). 
The transformation (2) carries О into 
(5) Q-(Cy-uyx" (Су в) 


-(C!y-C^CuyX (Cy - cca) 
= [c - CW) E [C GO - 6e] 
-(y-Cp)'(C?!) X" C^ (y-Cp) 

= (у - Сы) (СХС') (y- Cy) 


since (C^!) 2(C^)*! by virtue of transposition of CC ! =. Thus the 
density of Y is | 


(6) n(C^yli, 5) тойс" 
= (2т) СС exp[ (у - Cu) (CX€') (y - Св] 
=п(у|Сь, СУС"). и 


Now let us consider two sets of random variables X,,..., X, and 


Хожа Хр forming the vectors 
X Xu 
(7) ха = хә - |: 
X, X, 


These variables form the random vector 


О x(x] = 


Now let us assume that the p variates have a joint normal distribution with 
mean vectors 


(9) | £X = yh, 


X, 


р 


exo = и, 


24 LINEAR COMBINATIONS; MARGINAL DISTRIBUTIONS 


and covariance matrices 


(10) 
(11) 
(12) 


é(X9 — p)(XO =p) = X5, 
&(X® — p)(X@ - pY = 5, 
E(X? — p)(X@ — p) = Xs. 


25 


We say that the random vector X has been partitioned in (8) into subvectors, 


that 


(13) 


w= po 
p? 


has been partitioned similarly into subvectors, and that 


(14) 


has been partitioned similarly into submatrices. Неге $}, = X. (See Ap- 


Zu Xp 
Ў = 
В En 


pendix, Section A.3.) 
We shall show that X and ХӘ are independently normally distributed 


if Ў = 


(15) 


(16) 


X5, = 0. Then 


© Its inverse is 


Xy 9 
-] = 
z | 0 MI 


Thus the quadratic form in the exponent of n(xl p, X) is 


(17) Q= 


(х- ВУ н) 


X 0 x — yO 
= (D. pa (2) — 40 H 
[(x – 00), (x9 ~ v. ДЕ Me 


= [(a 7 DYER (x9 = уу | хо — p” 
p n Ld 22 хә p? 


= (xO — y) (0 — pO) + (x — pY EZ (40 — pw) 
=Q, +Q, 
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say, where 


Qi = (x — 2) Ep (xc _ њо), 


Q= (29 оу xg (x - a). 


8) 


Also we note that |Y| = [5 ‚| :[Х,,|. The density of X can be written 


n ею 
(19) (xl p, X) От)“ 


КЕ ew 11 er 
(2т) 11 Qe)" OE 


= n(x] pO, хи)" (хо, Sa). 


The marginal density of X® is given by the integral 


(20) fo fo mba. ваат, 


xX 20 
= п( хи, f -f n(x pw, Xo) dx,,, х 
=n(x pa, E). 


Thus the marginal distribution of X is N(p™®, X); similarly the marginal 
distribution of X? is М, £3). Thus the joint density of X,,..., X, is the 
product of the marginal density of X,,..., X, and the marginal density of 
Хосоо p and therefore the two sets of variates are independent. Since 
the numbering of variates can always be done so that X“” consists of any 
subset of the variates, we have proved the sufficiency in the following 
theorem: 


Theorem 2.4.2. If X,,..., X, have a joint normal distribution, a necessary 
and sufficient condition for one subset of the random variables and the subset 
consisting of the remaining variables to be independent is that each covariance of 
a variable from one set and a variable from the other set is 0. 
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The necessity follows from the fact that if X; is from one set and X, from 
the other, then for any density (see Section 2.2.3) 


(21) 9; = &(X; — uj) X; - uj) 
= f^ om f Qs n) м) оох) 
JG Яр) dx, "dx, 


sf o B) fn six.) dx, + dx, 


fon f ов) алоо) un dz, 
= 0. 


Since от; = 0,0; р» and о’, о; + 0 (we tacitly assume that X. is nonsingular), 
the condition о’, = 0 is equivalent to р, = 0. Thus if one set of variates is 
uncorrelated with the remaining variates, the two sets are independent. It 
should be emphasized that the implication of independence by lack of 
correlation depends on the assumption of normality, but the converse is 
always true. 

Let us consider the special case of the bivariate normal distribution. Then 
ХО = Ху, XO -X,, y" =p, Ш = р, Xj 7057 of, з= со = 0}, 
and Ў = Хз = с = 0,0. ру. Thus if X, and X, have a bivariate normal 
distribution, they are independent if and only if they are uncorrelated. If they 
are uncorrelated, the marginal distribution of X, is normal with mean м; and 
variance о;2. The above discussion also proves the following corollary: 


Corollary 2.4.1. If X is distributed according to Мл, €) and if a set of 
components of X is uncorrelated with the other components, the marginal 
distribution of the set is multivariate normal with means, variances, and covari- 
ances obtained by taking the corresponding components of y and У, respectively. 


Now let us show that the corollary holds even if the two sets are not 
independent. We partition X, p, and X as before. We shall make а 
nonsingular linear transformation to subvectors 


(22) YO = xO + gx o. 
(23) ү = ХО), 


choosing B so that the components of Y(? are uncorrelated with the 
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components of Y? = XO. The matrix B must satisfy the equation 
(24) 0= é(YO — €Y)(¥® - РУО), 

= é(x + РХО — gxO —-BéXO)(xO? — EX)! 


= E(X — ex?) + B(X® — exo)|(xO0 — Ex)’ 
= X5 + BE. 


Thus B= – У £3 and 


(25) YO = xo — X,X53X0. 
The vector 

70) , [Г -En 
as (2) |а ches 


is а nonsingular transform of X, and therefore has а normal distribution with 


yo 1 -Х 55: 
(27) y? = | I x 
= 1 -Xy4X4 pi? 
0 I p 
_ pi? — Хр» p - yO 
"o NC) 
= v, 
say, and 


(28) €(Y)- é(Y-v)(Y-v) 


é(yO _ yQO)(yo = yO)! é (YO _ у) (УФ - vy! 
é(y® — p(x — pd)! é(YOo ~ y?»(yo — y) 


— УХХ Za 0 
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since 
(29) é(yYO - y QO)(yO – wy! 
= #[(х® - p) -Z,ZEgj(x9- p)] 
:[(x - pO) — Iiz (X? – ey 
=f- Ур Xa T 21222 Хи + 212522 in?y Ха 
=F- Eny Хи. 


Thus У and Y® are independent, and by Corollary 2.4.1 X? = Y? has 
the marginal distribution N(w, €). Because the numbering of the compo- 
nents of X is arbitrary, we can state the following theorem: 


Theorem 2.43. If X is distributed according to Мы, X), the marginal 
distribution of any set of components of X is multivariate normal with means, 
variances, and covariances obtained by taking the corresponding components of 
и and X, respectively. 


Now consider any transformation 
(30) Z-DX, 


where Z has д components and D is a q X p real matrix. The expected value 
of Z is 


(31) $Z-Dy, 
and the covariance matrix is 
(32) 4(Z—-Dy)(Z-Dy)' = DED'. 


The case q =p and D nonsingular has been treated above. If q <p and D is 


: of rank q, we can find a (p — 4) X p matrix E such that 


w (= (s 


is a nonsingular transformation. (See Appendix, Section A.3.) Then Z and W 
have a joint normal distribution, and Z has a marginal normal distribution by 
Theorem 2.4.3. Thus for D of rank q (and X having a nonsingular distribu- 
tion, that is, a density) we have proved the following theorem: 
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Theorem 2.44. If X is distributed according to N(p, X), then Z = DX is 
distributed according to № Du, DX. D"), where D is a q X p matrix of rank q <p. 


The remainder of this section is devoted to the singular or degenerate 
normal distribution and the extension of Theorem 2.4.4 to the case of any 
matrix D. A singular distribution is a distribution in p-space that is concen- 
trated on a lower dimensional set; that is, the probability associated with any 
set not intersecting the given set is 0. In the case of the singular normal 
distribution the mass is concentrated on a given linear set {that is, the 
intersection of a number of (p — 1)-dimensional hyperplanes]. Let y be a set 
of coordinates in the linear set (the number of coordinates equaling the 
dimensionality of the linear set); then the parametric definition of the linear 
set can be given as x=Ay+A, where А is a pXq matrix and A is a 
p-vector. Suppose that Y is normally distributcd in the q-dimensional linear 
set; then we say that 


(34) X=AY+A 


has a singular or degenerate normal distribution in p-space. If £Y = v, then 
EX=Av+ = р, say. If (Y — vXY — v) = Т, then 


(35) — é(X-y)(X- p)' = A(Y-v)(Y- v)'A' - ATA =, 


say. It should be noticed that if p >q, then X is singular and therefore has 
no inverse, and thus we cannot write the normal density for X. In fact, X 
cannot have a density at all, because the fact that the probability of any set 
not intersecting the g-set is 0 would imply that the density is 0 almost 
everywhere. i 

Now. conversely, let us see that if X has mean p and covariance matrix X 
of rank г, it can be written as (34) (except for 0 probabilities), where X has 
an arbitrary distribution, and Y of r (<p) components has a suitable 
distribution. 16 } is of rank z, there is a p X p nonsingular matrix B such 
that 


I, 0 
36 ВУВ' = | o 


where the identity is of order г. (See Theorem A.4.1 of the Appendix.) The 
transformation 


_ye yo 
(37) BX =V= [у 
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defines а random vector V with covariance matrix (36) and а mean vector 


(38) éV-Byi-v- M 


v 


say. Since the variances of the elements of V? are zero, VO =v® with 
probability 1. Now partition 


(39) В! =(С D), 
where С consists of r columns. Then (37) is equivalent to 


а) 


(40) X=B"'V=(C 2). 


| = СиО + ру. 
Thus with probability 1 


(41) X - CVO + DyO, 


which is of the form of (34) with С as A, V as Y, and Dv? as А. 


Now we give a formal definition of a normal distribution that includes the 
singular distribution. 


Definition 2.4.1. A random vector X of p components with £X = u and 
$(X — AX X-— р) —X is said to be normally distributed [or is said to be 
distributed according to N(w,>)] if there is a transformation (34), where the 
number of rows of A is p and the number of columns is the rank of E, say r, and 
Y (of r components) has a nonsingular normal distribution, that is, has a density 


(42) ke x»-vYT -»). 


It is clear that if €; has rank p, then A can be taken to be J and A to be 
0; then X — Y and Definition 2.4.1 agrees with Section 2.3. To avoid redun- 
dancy in Definition 2.4.1 we could take T=J and v = 0. 


Theorem 2.4.5. If X is distributed according to Ми, X), then Z = DX is 
distributed according to N(Dp, DX D"). 


This theorem includes the cases where X may have a nonsingular or a 
singular distribution and D may be nonsingular or of rank less than q. Since 
X can be represented by (34) where Y has a nonsingular distribution 
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N(v, T), we can write 
(43) 2 = рАҮ+ DÀ, 


where DA is 4 Xr. If the rank of DA is ғ, the theorem is proved. If the rank 
is less than r, say s, then the covariance matrix of Z, 


(44) DATA'D' = E, 


say, is of rank s. By Theorem A.4.1 of the Appendix, there is a nonsingular 
matrix 


F F, 
(45) n F, 
such that 
4 FEF’ F,EF, F EF, 
(46) | FEF, Е, EF, 


(F,DA)T(F,DA) (F,DA)T(F,DA)| |1, 0 
(F,DA)T(F,DA) (F,DA)T(F,DA)] |0 0} 


Thus F,DA is of rank s (by the converse of Theorem A.1.1 of the Appendix), 
and F,DA = 0 because each diagonal element of (F,DAYT(F, DAY is a 
quadratic form in a row of Е, РА with positive definite matrix T. Thus the 
covariance matrix of FZ is (46), and 


+ЕБХ, 


F, F, DAY U, 
(47) Е2 = DAY + FDA = + ЕРА = 
Е, 0 0 


say. Clearly U, has a nonsingular normal distribution. Let F^! = (С, G;). 
Then 


(48) Z-GU, * DA, 
which is of the form (34). u 


The developments in this section can be illuminated by considering the 
geometric interpretation put forward in the previous section. The density of 
X is constant on the ellipsoids (54) of Section 2.3. Since the transformation 
(2) is a linear transformation (i.e., a change of coordinate axes), the density of 
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Y is constant on ellipsoids 


(49) (у= Си) (CEC) (»-Cu) =k. 


The marginal distribution of ХЧ) is the projection of the mass of the 
distribution of X onto the q-dimensional space of the first q coordinate axes. 
The surfaces of constant density are again ellipsoids. The projection of mass 
on any line is normal. 


2.5. CONDITIONAL DISTRIBUTIONS AND MULTIPLE 
CORRELATION COEFFICIENT 


2.5.1. Conditional Distributions 


In this section we find that conditional distributions derived from joint 
normal distribution are normal. The conditional distributions are of a partic- 
ularly simple nature because the means depend only linearly on the variates 
held fixed, and the variances and covariances do not depend at all on the 
values of the fixed variates. The theory of partial and multiple correlation 
discussed in this section was originally developed by Karl Pearson (1896) for 
three variables and exiended by Yule (18972, 1897b). 

Let X be distributed according to №, X) (with X nonsingular). Let us 
partition 


о x= [=] 


as before into q- and (p — q)-component subvectors, respectively. We shall 
use the algebra developed in Section 2.4 here. The joint density of yo =x 
-Xj5ZgX and YO = ХӘ is 


n(y| pi? -Xp X wo, Xu XoXg Ea )n(y?l p, Xa) 


- The density of X® and X® then can be obtained from this expression by 


.. substituting х0 — X, Xx? for y® and x® for y® (the Jacobian of this 
« transformation being 1); the resulting density of XO and X® is 


(2) 


1 

D yD) О 
fü) (27)* VIX.) expl 
Ehi? - в) - x5 Ez (a9? - в?) } 


- (хо - y?) - yXgl(x?- er 


1 v 
Gor misa m 100 — py Eg (x — ш), 
22 
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where 
(3) Eia = Èn Хх» a 


This density must be n(xl m, X). The conditional density of X (D given that 


ХӘ = x"! is the quotient of (2) and the marginal density of X O at the point . 


х, which is n(x?| pO, У), the second factor of (2). The quotient is 
(4) 


А 1 
fll) = 


Qs )* ЛХ 


exp{ - (хо — po?) - Ep E(x- u?)|' 


Tle - a) - Enin (x - y). 


; ; DI UG 
It is understood that x? consists of p — q numbers. The density хх) 
is a q-variate normal density with mean 


(3) 8 (XM xO) = pl X Xi (2® — p?) = v(x), 
say, and covariance matrix 
(6) efx” = v(x®)] [хо = v(x)] |} =} yen Xa 


It should be noted that the mean of X® given x is simply a linear function 
of х®. and the covariance matrix of ХХ given x? does not depend on x? 
at all. 


Definition 2.5.1. The matrix B =£, X]. is the matrix of regression coef- 
ficients of X on x. 
` The element in the ith row and (k — 9) column of В = £ £z is often 
denoted by 


(7) Bik-q+1 M k-l,. kl... p і = 1,...,9 k=qt hep 


The vector pi? + B(x — p”) is called the regression function. 
Let Tysa... р be the i,jth element of X;,;. We call these partial 
is а partial variance. 


COUATIANCES, Tig st... p 


Definition 2.5.2 


Од "M p 


, ije-1l...sq, 
ees р V iiast..p 


(8) Рене ар o Vai 7 
ig 


is the partial correlation between X; and X; holding X,,,,..., X, fixed. 
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'The numbering of the components of X is arbitrary and q is arbitrary. 
Hence, the above serves to define the conditional distribution of any q 
components of X given any other p — q components. In the case of partial 
covariances and correlations the conditioning variables are indicated by the. 
subscripts after the dot, and in the case of regression coefficients the 
dependent variable is indicated by the first subscript, the relevant condition- 
ing variable by the second subscript, and the other conditioning variables by 
the subscripts after the dot. Further, the notation accommodates the condi- 
tional distribution of any q variables conditional on any other r — q variables 
(q <" x p). 


Theorem 2.5.1. Let the components of X be divided into two groups com- 
posing the subvectors X® and XO. Suppose the mean y. is similarly divided into 
pO and p®, and suppose the covariance matrix X of X is divided то 
Zip Ei, £n, the covariance matrices of X, of ХФапа XO, and of ХО, 
respectively. Then if the distribution of X is normal, the conditional distribution of 
XO given XO =x is normal with mean y? + Xi, Xj! (x? — р) and 
covariance matrix €, — Èp Èp X. 


As an example of the above considerations let us consider the bivariate 
normal distribution and find the conditional distribution of X, given X, —x;. 
In this case a? = ш, B® = ро, X, = 02, Xj = 0,0; p, and У = с}. Thus 
the 1 X 1 matrix of regression coefficients is È, X3 = 0, p/o, and the 
1 X 1 matrix of partial covariances is 


(9) Xu2-XÀXu- Zp Ep Èa = of = ofo} p/o} = of (1 = о?). 


The density of X, given x, is nix! u, + (oy р/с, Хх, — m), e? - р?) 
The mean of this conditional distribution increases with x, when p is 
positive and decreases with increasing x, when p is negative. It may be 
noted that when с; = с, for example, the mean of the conditional distribu- 
tion of x, does not increase relative to у; as much as x, increases relative to 
из. [Galton (1889) observed that the average heights of sons whose fathers’ 
heights were above average tended to be less than the fathers’ heights; he 
called this effect “regression towards mediocrity.”] The larger |pl is, the 
smaller the variance of the conditional distribution, that is, the more infor- 
mation x, gives about x,. This is another reason for considering p a 
measure of association between X, and Х.. 

А geometrical interpretation of the theory is enlightening. The density 
f(x, x) can be thought of as a surface z = f(x,, x.) over the хі, x;-plane. If 
we intersect this surface with the plane x; = c, we obtain a curve 2 = f(x,,¢) 
over the line x =c in the x,,x,-plane. The ordinate of this curve is 


36 THE MULTIVARIATE NORMAL DISTRIBUTION 


proportional to the conditional density of X, given x,=c; that is, it is 
proportional to the ordinate of the curve of a univariate normal distribution. 
In the more general case it is convenient to consider the ellipsoids of 
constant density in the p-dimensional space. Then the surfaces of constant 
density of fGx,,...,x,10, 44, ..., 6,) are the intersections of the surfaces of 


constant density of f(x,,..., x,) and the hyperplanes ха; = Сол». Xp = 


Ср; these are again ellipsoids. 

Further clarification of these ideas may be had by consideration of an 
actual population which is idealized by a normal distribution. Consider, for 
example, a population of father-son pairs. If the population is reasonably 
homogeneous, the heights of fathers and the heights of corresponding sons 
have approximately a normal distribution (over a certain range). A condi- 
tional distribution may be obtained by considering sons of all fatuers whose 
height is, say, 5 feet, 9 inches (to the accuracy of measurement); the heights 
of these sons will have an approximate univariate normal distribution. The 
mean of this normal distribution will differ from the mean of the heights of 
sons whose fathers' heights are 5 feet, 4 inches, say, but the variances will be 
about the same. . 

We could also consider triplets of observations, the height of a father, 
height of the oldest son, and height of the next oldest son. The collection of 
heights of two sons given that the fathers' heights are 5 feet, 9 inches is a 
conditional distribution of two variables; the correlation between the heights 
of oldest and next oldest sons is a partial correlation coefficient. Holding the 
fathers’ heights constant eliminates the effect of heredity from fathers; 
however, one would expect that the partial correlation coefficient would be 
positive, since the effect of mothers' heredity and environmental factors 
would tend to cause brothers’ heights to vary similarly. 

As we have remarked above, any conditional distribution obtained from a 
normal distribution is normal with the mean a linear function of the variables 
held fixed and the covariance matrix constant. In the case of nonnormal 
distributions the conditional distribution of опе. set of variates on another 
does not usually have these properties. However, one can construct nonnor- 
mal distributions such that some conditional distributions have these proper- 
ties. This can be done by taking as the density of X the product n[x®] p® + 
B(x — 0), Хх), where f(x) is an arbitrary density. | 


2.5.2. The Multiple Correlation Coefficient 


We again consider X partitioned into X and ХӘ, We shalt study some 
properties of BX. 
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Definition 2.5.3. The vector ХО? = X — y? — B(X® — Ш?) is the vec- 
tor of residuals of X® from its regression on X a, 


Theorem 2.5.2. The components of XV? are uncorrelated with the compo- 
nents of XO. 


Proof. The vector X? is УО — ФУ in (25) of Section 2.4. и 


. Let at; be the ith row of Хр, and Bin the ith row of B (e, Во = 


| о! 552). Let VEZ) be the variance of Z. 


Theorem 2.5.3. For every vector a 
(10) (XE) < Y(X, - a'x9). 
Proof. By Theorem 2.5.2 
(11) Y(X, >a'X®) 
- арха (х9 a] 
= &[ xe - Хо + (Bu - а) (X? - p)]? 
= Y[XE?] + (B-a) A(X? - и?) (ХФ - 9) (Bo а) 
= V(X!) + (Bo - &)'22(Bo - 9): 


Since £, is positive definite, the quadratic form in B — « is nonnegative 
and attains its minimum of 0 at a = B. u 


Since PXT = 9, VX?) = (ХЭ). Thus ш + Bo (XO? — w) is the 
best linear predictor of X; in the sense that of all functions of X O of the form 


a' XO + с, the mean squared error of the above is а minimum. 


Theorem 2.5.4. For every vector а 
(12) Corr( X;, Bj, X?) > Corr( X;, а 'x9). 


Proof. Since the correlation between two variables is unchanged when 
either or both is multiplied by a positive constant, we can assume that 
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[а (ХӘ — p)P = E(B, (XO — pN. Then tlie expansion of (10) is 
(13) — e-26(X,- n)Bo(X? - и) + Y (Bu X9) 

€0;-26(X,- u)a'(X? — y?) + (а ХӘ). 
This leads to 
é(X,—- p)Bo( X? - au?) > é(X;— ш)а (x9 — we) 


ү o Y (Bi X?) i ү o, Y (a'X9)) 


Definition 2.5.4. The maximum correlation between X; and the linear com- 
bination «'X® is called the multiple correlation coefficient between X, and Х®. 


(14) 


It follows that this is 


uS овие нен) 


m ! -1 
Е eX Tii) _ У 9532 Fi 
7 ! yl Jo; у 
Ус, y 9*2 9, gi; 


A useful formula is 


(16) 1-2? 


(17) s- Tii 9 
! ' Tii) Xn 
Since 
(18) Üigri,...p^ Ti eX Tii» 


it follows that 
(19) Diiq+l e p (1 — Кан ven, plir 


This shows incidentally that any partial variance of a component of X cannot 
be greater than the variance. In fact, the larger Rj.g41,...,p is, the greater the 
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reduction in variance on going to the conditional distribution. This fact is 
another reason for considering the multiple correlation coefficient a measure 
of association between X, and ХО). 

That B(,XC is the best linear predictor of X, and has the maximum 


correlation between Х; and linear functions of X(? depends only on the 
covariance structure, without regard to normality. Even if X does not have a 
normal distribution, the regression of Х on Х® can be defined by 
pO + X47 (XP — pw); the residuals can be defined by Definition 2.5.3; 
and partial covariances and correlations can be defined as the covariances 
and correlations of residuals yielding (3) and (8). Then these quantities do 
not necessarily have interpretations in terms of conditional distributions. In 
the case of normality ш, + Bi (x? — p®) is the conditional expectation of X; 
given ХО = х0), Without regard to normality, X, — &X,|X@ is uncorrelated 
with any function of XO, &X,|X® minimizes A[X; — &(XO)f. with respect 
to functions A(X) of XO, and £X,X? maximizes the correlation between · 
X; and functions of XO. (See Problems 2.48 to 2.51.) 


2.5.3. Some Formulas for Partial Correlations 


We now consider relations between several conditional distributions obtained 
by holding several different sets of variates fixed. These relations are useful 
because they enable us to compute one set of conditional parameters from 
another set. A very special case is 


(20) . P12 — P13 P23 . 


P123 = о A рі, > 


this follows from (8) when p —3 and д = 2. We shall now find a generaliza- 


tion of this result. The derivation is tedious, but is given here for complete- 
ness. 


Let 
xo 
(21) X=] хо 
хо 


where X is of р, components, XO of р, components, and X® of p, 
components. Suppose we have the conditional distribution of X? and X? 
given ХО) = х9); how do we find the conditional distribution of X? given 
XO = x and XO = xO? We use the fact that the conditional density of X? 
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given XO = х0 and XO =x is 
. (0) 40) ,0) 
x) x x 
(22) f(a, 0 = Oni x) 
| | f(x9, х) 


7 f( x) x x) /f(x)) 
f(x, x) /f( x0) 


f x, xx) 
2 f(xOpx9) 


In the case of normality the conditional covariance matrix of X and X 
given XO =: х9) is 


(23) «Хо 


> E X 
e| І iz; x І | En) 


say, where 
Xy X, Уз 
(24) Ў = X3 X» Xs 
Ea Zo Xa 


The conditional covariance of X? given X? = x? and XO = х) is calcu- 
lated from the conditional covariances of X? and ХО given XO = x? аз 


(25) e[x Ox, xO] 7ÀX45-Xp3(Xn3) "Eas 


This result permits the calculation of cji, 4, 5, 5j 7 Lh... po from 
Ui j.p. py. p? i,J=l,..., py + Pz А 
In particular, for р; =q, p; = 1, and рз =p – 9 — 1, we obtain’ 


26 _ Oi 4+1-4+2....,р9},9+1-4+2,....р 
(26) р Орав р рә 
4+1,4+1-4+2,...,Р 
ij = 1,...,q. 
Since 
2 
(27) Üi.g41,..., p^ Ciiqe2,..., p(l- Pr aquas) 
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we obtain 


Оіј-д+2,....р — Pi. qe Uq*2,..., 0, д+1:5+2,....р 
(28) Раяз ар Piggy PP ge ЕН 


This is a useful recursion formula to compute from {ру} in succession 
Coi ph { Pij-p-1, phe P12:3,..., р’ 
2.6. THE CHARACTERISTIC FUNCTION; MOMENTS 


2.6.1. The Characteristic Function 


The characteristic function of a multivariate normal distribution has a form 
similar to the density function. From the characteristic function, moments 
and cumulants can be found easily. 


Definition 2.6.1. The characteristic function of a random vector X is 
(1) e(t) = Фен х 
defined for every real vector t. 


"Го make this definition meaningful we need to define the expected value 
of a complex-valued function of a random vector. 


Definition 2.6.2. Let the complex-valued function g(x) be written as g(x) 
=g,(«) + igY x), where g(x) and g,(x) are real-valued. Then the expected value 
of g(X) is 


(2) 68(Х) = ég( X) +idg (X). 
In particular, since є! = cos 6 + isin Ө, 
(3). . Фе! = g cost'X c id sint'X. 


To evaluate the characteristic function of a vector X, it is often convenient - 
to use the following lemma: 


Lemma 2.6.1. Let X' = (X9'XO?), If X? and XO are independent and 
g(x) = g(r) g(x), then 


p (4) 8 9(X) = £g (X9) eg? (X9). 
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Proof. И g(x) is real-valued and X has a density, 


(5) ве) f. « f вою) du "dx, 


2[ ... f. g(a) g(x) fO 0) fO (x09) dx, + dx, 
-x -xX 
Г... Г g (0) FO (9) dx, ах, 


fo m Г g(x) f(x) dx, , р 
-х Jax 
= 620( X9) Bg x9). 
If g(x) is complex-valued, 
(6) аб) = Lae) itc] [s (G9) io] 

= gi G9) g (a) 000) 22) 

+ if g(a) (2) +P) ()]. 
Then 
(т) вех) = вх) (хо) - gor sure] 

18 [sPan gor чур] 
= гохо) бархе) – gor EPX?) 

+ if охо) £g (x9) + bg XM) ego] 
= [EEP X) egt m] [gr e reg Pare] 
= £g ( X9) £gO( X9). " 

By applying Lemma 2.6.1 successively to g( X) — e'* X, we derive 


Lemma 2.62. If the components of X are mutually independent, 
p А 
(8) get = П Феі. 
j=l 


We now find the characteristic function of a random vector with a normal 
distribution. 
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Theorem 2.6.1. The characteristic function of X distributed according to 
N(p, X) is 


(9) (t) = Beit X = git'u- pede 
for every real vector t. 


Proof. From Corollary A.1.6 of the Appendix we know there is a nonsingu- 
lar matrix C such that 


(10) CX =1. 

Thus 

(11) УСС! = (cc). 
Let 

(12) © X-p= CY. 


Then Y is distributed according to N(0, I). 
Now the characteristic function of Y is 


p 
(13) ф(и) = £e'"* = П ейт. 
ja 


Since Y, is distributed according to N(0, 1), 


p 
(14) y(n) = Te ze 
ja 
Thus 
(15) e(t) = Фейх = dg'XCY*v) 


= ейи рейсу 


= еве 5G"CX'CY 


for t'C—u'; the third equality is verified by writing both sides of it as 
integrals. But this is 


(16) (t) = ене à3'CCt 


РРО ley, 
= ейн it 2: 


by (11). This proves the theorem. и 
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The characteristic function of the normal distribution is very useful. For 
example, we can use this method of proof to demonstrate the results of 
Section 2.4. If Z = DX, then the characteristic function of Z is 


(17) Фе! 'Z = Фей "DX — = ge OX 
e type- EO YED) 


= gil’ (De) 3 (DEDE 
? 


which is the characteristic function of N(Dp, D&D‘) (by Theorem 2.6.1). 

It is interesting to use the characteristic function to show that it is only the 
multivariate normal distribution that has the property that every linear 
combination of variates is normally distributed. Consider a vector Y of p 
components with density f(y) and characteristic function 


(18) Wu) ве fo (^ eti) dyi diy. 


and suppose the mean of Y is p and the covariance matrix is 5. Suppose u'Y 


is normally distributed for every u. Then the characteristic function of such 
linear combination is 


(19) Фей = gitu'n- Уи’ Хи, 


Now set t = 1. Since the right-hand side is then the characteristic function of 
N(p, X), the result is proved (by Theorem 2.6.1 above and 2.6.3 below). 


Theorem 2.6.2. І every linear combination of the components of a vector Y 
is normally distributed, then Y is normally distributed. 


It might be pointed out in passing that it is essential that every linear 
combination be normally distributed for Theorem 2.6.2 to hold. For instance, 
if Y (Ү,,У,) and У, and Y, are not independent, then У, and У, can each 
have a marginal normal distribution. An example is most easily given geomet- 
tically. Let X,, X, have a joint normal distribution with means 0. Move the 
same mass in Figure 2.1 from rectangle A to C and from В to D. It will be 
seen that the resulting distribution of Y is such that the marginal distribu- 
tions of Y, and Y, are the same as X, and Хз, respectively, which are 
normal, and yet the joint distribution of Y, and Y, is not normal. 

This example can be used also to demonstrate that two variables, Y, and 
Y;, can be uncorrelated and the marginal distribution of each may be normal, 
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Figure 2.1 


but the pair need not have a joint normal distribution and need not be 
independent. This is done by choosing the rectangles so that for the resultant 
distribution the expected value of Y,Y, is zero. It is clear geometrically that 
this can be done. 

For future reference we state two useful theorems concerning characteris- 
tic functions. 


Theorem 2.6.3. If the random vector X has the density f(x) and the 
characteristic function dt), then 


(20) f(x) = Gy. fei di, = diy. 


This shows that the characteristic function determines the density function 
uniquely. If X does not have a density, the characteristic function uniquely 
defines the probability of any continuity interval. In the univariate case a 
‘continuity interval is an interval such that the cdf does not have a discontinu- 
ity at an endpoint of the interval. 


' Theorem 2.6.4. Let (F(x)) be a sequence of cdfs, and let ($(1)) be the 
sequence of corresponding characteristic functions. A necessary and sufficient 
condition for F, (x) to converge to a cdf F(x) is that, for every t, (t) converges 
i0 a limit (1) that is continuous at t = 0. When this condition is satisfied, the 
limit (t) is identical with the characteristic function of the limiting distribution 
F(x). 


For the proofs of these two theorems, the reader is referred to Cramér 


#(1946), Sections 10.6 and 10.7. 
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2.6.2. The Moments and Cumulants 


The moments of X,,..., Хр with a joint normal distribution can be obtained 
from the characteristic function (9). The mean is 


_ 1 9$ 
(21) eX, = T Ot, о 
1 +; (t) 
=т\- Уо tinjo 
j t=0 
= Bae 
The second moment is 
1 9% 
(22) &X,X;= 8 3t, àt; s 
= 4{(- Mott in, - Уо, ein) - seco 
г k k #20 
= Ohj + Ba Hj 
Thus 
(23) Variance( X;) = €( X, — ш)? = 04, 
(24) Covariance( X;, X;) = &( X, - ш)(Х,- Bj) = ei 


Any third moment about the mean is 

(25) #(Х,- i )(X; 7 nj) Xy — i) = 9. 

The fourth moment about the mean is 

Q6) &(X,- ш)(Х;- uj) CX, = n )( X — щш) = 9;9ы + иди + 9и0ук. 
Every moment of odd order is 0. 


Definition 2.63. Jf all the moments of a distribution exist, then the cumu- 
lants are the coefficients к in 


> (in) = (ity) 
(27) lgé()- E es sms 
0 
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In the case of the multivariate normal distribution куо...0 = His- -+> Ко... 01 


= Mps Kop... = би»---›Ко...02 = Opps Кио...0 = 01»... The cumulants for 
which Ès; > 2 аге 0. | 


2.7. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


2.7.1. Spherically and Elliptically Contoured Distributions 


It was noted at the end of Section 2.3 that the density of the multivariate 
normal distribution with mean р. and covariance matrix Ӯ, is constant on 


concentric ellipsoids 
(1) (х= p) X (x-p)-k. 
A general class of distributions with this property is the class of elliptically 


contoured distributions with density 


(2) [Al Óg[(x - v) A71 (x - v)]. 


where A isa positive definite matrix, g(-) > 0, and 


» fo gone dnt 


If C is а nonsingular matrix such that C'A^!C = I, the transformation 
x—v-Cy carries the density (2) to the density g(y'y). The contours of 
constant density of g(y'y) are spheres centered at the origin. The class of 
such densities is known as the spherically contoured distributions. Elliptically . 
contoured distributions do not necessarily have densities, but in this exposi- 
tion only distributions with densities will be treated for statistical inference. 


А spherically contoured density can be expressed in polar coordinates by 
the transformation 


(4) yir Sin 6,, 
Уз =r cos ĝ sin 0,, 


Уз =r COS 0, COS 0, sin 03, 


Ур-1 =r COS 0; COS 0, ··-с08 6,» SinO, ;, 


Ур =F COS 0; COSÓ; ++ COS 0, .; COS O, ,, 
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where —;7«6, xim, 1=1,...р-2, —m«6, , < т, and 0<ғ<%, 
Note that y’y=r?. The Jacobian of the transformation (4) is 
r?7! 052-20, cos? 78, ++ cos 6, .,. See Problem 7.1. № g(y'y) is the density 
of Y, then the density of R,0,...,0,., is 

(5) r^^! cosP~?@, cos?-?0, ++ cos 0, , g(r?). 


Note that R,®,,. 2S8, are independently distributed. Since 


"и gas TORG) 
o [eo n d 


(Problem 7.2), the marginal density of К is 


(7) С( p)g(r?)r^^!, 
where 
(8) 

APO X 


= f p e f" cos?7?8, cos? 720, -+ cos 6,246, ·- d6, 5 dO, 
=n -m/2 -т/2 


The marginal density of Ө, is T[3( p — D]eos? ~i :0/(Г0)Г[(р – i - DI, 
i=1,...,p—2, and of 6, , is1/Qm) — 
In the normal case of N(0, Г) the density of Y is 


ety’ ») -(27) ” exp( — ty’ у), 


and the density of R = (Y'Y)! is r?~! ехр( 1r?)/[2 ^! T (3p)]. The density 
of г? = y is v? 1е- #/[2¥T (4p). This is the y?-density with p degrees of 
freedom. 

The constant C(p) is the surface area of a sphere of unit radius in p 
dimensions. The random vector U with coordinates sin ©}, cos ©, sin Ө,,.. 
cos ©, cos Ө, = cos Ө, .,, where 0,,..., 0, , are independently distributed 
each with the uniform distribution over (— 7/2, 7/2) except for @,_, having 
the uniform distribution over ( — т, т), is said to be uniformly distributed on 
the unit sphere. (This is the simplest example of a spherically contoured 


distribution not having a density.) A stochastic representation of Y with ui 


df ФК < оо. By symmetry ФИ? = 
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density gCy'y) is 


(9) Y Ru, 


where R has the density (7). 
Since each of the densities of 0,,..., 9,-, are even, 


«10) éU=0. 
Because R and U are independent, 

(11) ФҮ= 0 

if ФК < оо. Further, 

(12) ФҮҮ'= ER €uu' 


= U? =1/p because Ef.,U? = 1. 
Again by symmetry $010, = £U,U; = > = €U,_,U,. In particular 200, 
= £ sin ©, cos Ө, sin ©,, the integrand of which is an odd function of 9, and 
of 6,. Hence, &U,U, = 0, і +]. To summarize, 


(13) &UU' = (1/p)I, 
and 

(14) ФҮҮ' =(1/p) ERI 
(if @R? < oo). 


The distinguishing characteristic of the class of spherically contoured 
distributions is that OY < Y for every orthogonal matrix О. 


Theorem 2.7.1. If Y has the density gCy' y), then Z = OY, where O'O — I, 
has the density g(z'z). 


í Proof. The transformation z = Oy has Jacobian 1. E 


We shall extend the definition of Y being spherically contoured to any 
distribution with the property OY £ Y. 


. Corollary 2.7.1. If Y is spherically contoured with stochastic representation 
Y € RU with R? = Y'Y, then U is spherically contoured. 


Proof. If Z = OY and henze Z £Y, and Z has the stochastic representa- 
tion Z = SV, where 5? = Z'Z, then 5 = К and у= OU É 0. Г] 
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The density of Х=и+ CY is (2). From (1) and (14) we derive the 
following theorem: 


Theorem 2.7.2. If X has the density (2) and ФК? < о, 
(5) X=p=v, 6(X)=6(X—p)(X- в) = (1/p) ERA. 


In fact if ÆR” < oo, a moment of X of order й (<m) в &CX, — m)" = 

(X, — ш)" = EZ} e.. Zhe &R^/&( xD, where Z has the distribution 
p 

NO, £) and h - hj + ^ +h, 


Theorem 2.7.3. If X has the density (2, ФЕ? < оо, and fice@(X = 
fL€ (X)] for all c > 0, then fL £O] = РС). NE 

In particular p; (X) = 2jj/ ү 010); = А МАА , where X = (о) and A= 
(1). 


2.7.2. Distributions of Linear Combinations; Marginal Distributions 


First we consider a spherically contoured distribution with density gCy'y). 
Let у’ = (у, y2), where y, and y; have 4 and р-а components, respec- 
tively. The marginal density of y; is 


oc 


(16) Í -f BO +552) 4 dy, 


Express y; in polar coordinates (4) with r replaced by r, and p replaced by 
q. Then the marginal density of y; is 


(17) = gi») =С(а) ] g(r? +) dry. 


This expression shows that the marginal distribution of y, has a density 
which is spherically contoured. . 
- Now consider a vector X' = (X'', X?) with density (2). И eR? < oo, 


the covariance matrix of X is (15) partitioned as (14) of Section 24. · 


Let Z? = xv - УХХ = ХИ) — Ар АХО), 70 = хо, то = 
y0 -X4,X5 9-290 -A&Azv9, 4 = y O, Then the density of Z' = 
(ZD, Z0) is 

-l -L t 1 
(18) IA nal НА 1g [ Cz? -40) Anal” -7 ) 


+ (2% — y)! A (2 _ y)]. 
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Note that 2‘) and Z are uncorrelated even though possibly dependent. 
Let C, and С, be q Xq and (p - 4) Xp — q) matrices satisfying C, А.С 
=I, and САС, =1,... Define y and y® by 0 — 490 = C, y and 
20 — yO = С, y”. Треп УС? and Y? have the density g( y(U' y® + y у), 
The marginal density of Y? is (17), and the marginal density of ХО = ZO is 


(19) 1А] ~ig [(x® -vy AZ (x2 — ye] 
со 
= ca f glr? + (x0 -vy Az] (x9 — v@] 1 ан. 


The moments of Y, can be calculated from the moments of Y. 

The generalization of Theorem 2.4.1 to elliptically contoured distributions 
is the following: Let X with p components have the density (2). Then Y - CX 
has the density |CAC'| ^ 3gl(x — CY) (CAC! (x — Cv)] for С nonsingular. 

The generalization of Theorem 2.4.4 is the following: If X has the density 
(2), then Z = DX has the density 


(20) IDA D'IT ?g|(z- Dv) (DA D'Y (z- Dv)]. 


where D isa а X p matrix of rank q <p and g, is given by (17). 
We can also characterize marginal distributions in terms of the represen- 
tation (9). Consider 


[фарр p U® 
(21) r-va] av - | 


where Y and 00 have q components and Y® and U® have р-а 
components. Then R2 = YO'YO has the distribution of RU®'U®, and 


ии yy 
QQ = а 
(22) UU UU YY - 


In the case Y ~ N(0, 1), (22) has the beta distribution, say В(р — q, q), with 
density 


O) таро” 079055 0251. 


Hence, in general, 


(24) YO LRV, 
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where R3 £ К?Ь, b~ B(p —q,q), V has the uniform distribution of v'v = 1 


in p, dimensions, and R?, b, and V are independent. АЙ marginal distribu- 
tions are elliptically contoured. 


2.7.3. Conditional Distributions and Multiple Correlation Coefficient 


The density of the conditional distribution of y; given y; when y (1, 05) - 


has the spherical density gCy'y) is 


gi t¥2¥2) _ BCI +72) 
82( 22) £(r2) ' 


where the marginal density g;( y; y?) is given by (17) and r2 = у» уз. In terms 
of y,, (25) is a spherically contoured distribution (depending on rz). 

Now consider X = (X4, X2)' with density (2). The conditional density of 
X® given XO =x® is E 


(26) 


ГА palT gila — » 9! — (x – vOy BAF [x — v® — B(x® ~ v9] 


(25). 


45 


+(x — 00у A31 x0 — vp} 
= go[(x - v9) AZ) (x0 — v9] | t 

- A nal Ig( 00 -BaP -v Ат 9 -B(x — v)] +r) 
+182073), 


where r2 = (x – у)’ А 3] (xO — v9) and В = А, A31. The density (26) is 
elliptically contoured in x? — v® — B(x — vO) as a function of x“. The 
conditional mean of X® given XO =x is 
(27) E(X) =p + B(x — vy) 
if (Ку, уз = r2) < oo in (25), where Rj = Y1Y,. Also the conditional covari- 
ance matrix is (£r2/4) A 1.2. It follows that Definition 2.5.2 of the partial 
correlation coefficient holds when (0j;.41,...,p) = Жи. = У + Lyn in 
and X is the parameter matrix given above. 

Theorems 2.5.2, 2.5.3, and 2.5.4 are true for any elliptically contoured 
distribution for which 2А? < oo. 


2.7.4. The Characteristic Function; Moments 


The characteristic function of a random vector Y with a spherically con- 
toured distribution Фе has the property of invariance over orthogonal 
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transformations, that is, 


" со © 

(28) Beit Oa | ебу) dyp 
= f. e"*g(z'z) dz, - dz, 
= get 2 | 


where Z = OY also has the density g(y’y). The equality (28) for all orthogo- 
nal О implies £e“? is a function of t't. We write 


(29) £e! = (t't). 


Then for X ^ p + CY 


(30) 


dé НХ = ei" get EY 
= el'"o(t'CC't) 
=e "d(t' At) 


when А = СС’. Conversely, any characteristic function of the form 
e (t At) corresponding to a density corresponds to a random vector X 
with the density (2). 

- The moments of X with an elliptically contoured distribution can be 
found from the characteristic function еф У) or from the representa- 
tion X= p + RCU, where C'A^71C = I. Note that 


(31) ER? = c(p) | rtg) are 2р0), 


(32) BR‘ = Ср) | relr?) а= 4р(р+2)$" (O. 


, Consider the higher-order moments of Y= RU. The odd-order moments 
òf R are 0, and hence the odd-order moments of Y are 0. 
We have 


(33) 


In fact, all moments of X — р of odd order are 0. 
Consider 2000,0. Because U'U = 1, 


é(X,— ш)(Х, -= jj) CX. — ик) =0. 


p 
1= Y, 60202 - péUf + p( p - 1) #0002. 
i,j=l 


(34) 
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Integration of é'sin'O, gives 604 -3/[p(p-* 2) then (34) implies 
EUU? =1Ир(р + 2). Hence bY; - 3éR* A pCp + 2)] and ФУ, У? = 
eR рр + 2)). Unless =] =К=Е or icj*k-l or i-ks*j-l or 
i-l*j-k. we have 2000,0 = 0. To summarize ФИ ЦИ, Ц, = (ё + 
86,8, + &j840/LpX p + 2). The fourth-order moments of X are 
IK “j i 
(35) &(X, - ш)(Х = uj) Xi (XT m) 
ФК“ 
= Уря) Ан + Air Ay Aug) 
4 
= oR. -L ntu + сабр + OOk). 


The fourth cumulant of the ith component of X standardized by its 
standard deviation is 


368 _ ( eR j 
p 


é(X; - m) PP) АР 
(36) Е 2 2 T ER? ? 
[eX uy] ЕЗ 
eR p | JEO E 
cera | [rz 
=Зк, 


: 4 
sav. This is known as the kurtosis. (Note that к is LEX; pw 
LEX,- u^ T) - 1) The standardized fourth cumulant is Зк for every 
component of X. The fourth cumulant of X;, Xj X,, and X, is 
(37) | 
Kijki 7 é( X; — i) X; = Bj X: -aX By) — (9:9 + Oik Ft + 0405.) 


= к( орон + Fie GT ац). 
For the normal distribution к = 0. The fourth-order moments can be written 
(38) &(X; - n) (X; ш)(Х 7 CX 7 m) 
=(1+к)( оо + Oig Tj + 019). 


More detail about elliptically contoured distributions can be found in Fang 
and Zhang (1990). 


27 ELLIPTICALLY CONTOURED DISTRIBUTIONS 55 


The class of elliptically contoured distributions generalizes the normal 
distribution, introducing more flexibility; the kurtosis is not required to be 0. 
The typical “bell-shaped surface" of |A| ^ gl(x — v) A7! (x — v)] can be 
more or less peaked than in the case of the normal distribution. In the next 
subsection some examples are given. 


2.7.5. Examples 


(1) The multivariate t-distribution. Suppose Z ~ №, І,), ms? £ x2, and Z 
and s? are independent. Define Ү = (1 /5)7. Then the density of Y is 


m+p mp 
Г i = E] 
(39) CI] (+22) ^, 
| г) тет т 
апа 
R |у? m X 
40 ак =. 
(40) р р рт py? 


If X= р + CY, the density o^ X is 


mp - mp) 
г 2 ) И. № А (х-и) "р 
(41) —— —— AH + p 
г) тот 


(2) Contaminated normal. Тһе contaminated normal distribution is а mix- 
ture of two normal distributions with proportional covariance matrices and 
the same mean vector. The density can be written 


1 1 РЕ 
42 l-e -e £a) A^ l(x-phk) 
(#2) CT Oy [AT 


p-(1/2cXx- B) A7 (xp) 
А 


1 
+ 5р 
(2a)? |cAl? 


where c > 0 and 0 < e x 1. Usually e is rather small and c rather large. 
(3) Mixtures of normal distributions. Let w(v) be a cumulative distribution 
function over 0 < v x оо. Then a mixture of normal densities is defined by 


(43) [ nsus E) aco, 
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which is an elliptically contoured density. The random vector X with this 
density has a representation X = 02, where Z ~ №, X) and w ~ w(w) are 
independent. : 

Fang, Kotz, and Ng (1990) have discussed (43) and have given other 
examples of elliptically contoured distributions. 


PROBLEMS 


2.1. 


(Sec, 2.2) Let f(x, y) = ,0 x «1, 0xyx1, 
= 0, otherwise. 


Find: 


(a) F(x, y). 

(b) F(x). 

(е) f(x). 

(d) f(xly). [Note: fGxglyg) = 0 if. f(x, yo) = 0.] 
(е) #Х"у". 

(f) Prove X and Y are independent. 


2.2. (Sec. 2.2) Let f(x, у) = 2, 0 <у<х5<1, 
= 0, otherwise. 
Find: 
(a) F(x, y). (f) КУ). 
(b) FG). (в) flx). 
(e) fx). (в) Ху". 
(d) GCy). (i) Are X and Y independent? 
(е) gCy). 


2.3. 


2.4. 


2.5. 


(бес. 22) Let (х,у) = С for х? +у? <k? and 0 elsewhere. Prove C= 
l/Grk?) &X = 6У=0, &X?-— £Y?—k?/4, and @XY=0. Are X and Y 
indeperdent? 


(Sec. 22) Let F(x,x;) be the joint cdf of Ху, Xz, and let Еҳх,) be the 
marginal cdf of X; i — 1,2. Prove that if F(xj) is continuous, i= 1,2, then 
F(x, x3) is continuous. у 


(Scc. 2.2) Show that if the set JX,..., X, is independent of the set 
Х,ал Xp then | 


8g (X, X AX ases Xp) = @8(ХЬ.. XM Kear yee Хр). 


Е 6. 


2.7. 


2.11. 
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(бес. 2.3) Sketch the ellipses f(x, у) = 0.06, where f(x,y) is the bivariate 
normal density with 


(a) u,71, ну = 2, oj = 1, gy = 1, pry = 0 
(b) à, = 0, ру = 0, o? =1, gj = 1, Pry = 0 
(©) р, = 0, 4,70, 02 = 1, o = 1, р.у = 022. 
(d) џ.= 0, ру =0, oj = 1, о> = 1, Pry =.0.8. 
(е) р, = 0, м, = 0, 02 = 4, 0) = 1, Py = 08. 


(Sec. 2.3) Find b and А so that the following densities can be written іп the 
form of (23). Also find Hy, Hys Ors Fy and Pyy- 


(a) epl- На - D? (у - FD. 


(b) 


1 x1/4—- 1.63y/2 +y? 
gaze 77 072 ^ } 


(c) zorni- а? +y? +4х — 6у + 13)]. 


(d) zel- 10242 +y? + 2ay – 22x — 14у + 65)]. 


. (бес. 2.3) For each matrix A in Problem 2.7 find С so that C'AC =I. 


. (Sec. 2.3) Let b = 0, 


(a) Write the density (23). 
(b) Find $. 


(Sec. 2.3) Prove that the principal axes of (55) of Section 2.3 are along the 45° 
and 135° lines with lengths 2үс(1+ р) and 2yc(1- p), respectively, by 
transforming according to y, = (21 + 22)/ 32,57 G, 72)/ 2. 


(Sec. 2.3) Suppose the scalar random variables X;,..., X, are ‘independent 
and have a density which is a function only of x? + + +х2. Prove that the X; 


. are normally distributed with mean 0 and common variance. Indicate the 


mildest conditions on the density for your proof. 
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2.12. 


2.15. 


2.16. 


2.17. 


2.18. 


2.19. 


2.20. 


2.21. 


2.22. 
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(Зес.-2.3) Show that if Pr(.X > 0, Y 2 0) = а for the distribution 


A 1) 


then р = соѕ(1 — 2а)т. [ Hint: Let X - U,Y - pU + y1— р? and verify p= 
cos2-(i — a) geometrically.] 


. (Sec. 2.3) Prove that if pj p, ij, 5j = LP, then р> —1/(p - 1). 


. (бес. 2.3) Concentration ellipsoid. Let the density of the p-component Y be 


Ку) =TGp + ПИФ + 2a}? for y'y xp * 2 and 0 elsewhere. Then #У= 0 
and £YY' = I (Problem 7.4). From this result prove that if the density of X is 
gix)- VIAL Tp + ПИР + Dr]? for (x — в’ А(х р) < р +2 and 0 else- 
where, then £X = p and. &(X -aXX — В) =Aq!. 


(Sec. 24) Show that when X is normally distributed the components are 
mutually independent if and only if the covariance matrix is diagonal. 


(Sec. 2.4) Find necessary and sufficient conditions on A so that 4У + А has a 
continuous cdf. 


(Sec. 2.4) Which densities in Problem 2.7 define distributions in which X and 
Y are independent? 


(Sec. 2.4) 


(a) Write the marginal density of X for each case in Problem 2.6. 


(b) Indicate the marginal distribution of X for each case in Problem 2.7 by the 
notation N(a, b). 


(c) Write the marginal density of X, and X; in Problem 2.9. 


(Sec. 2.4) What is the distribution of Z = X — Y when X and Y have each of 
the densities in Problem 2.6? 


(Sec. 24) What is the distribution of X, + 2X; - 3X, when X,, Хз, Хз have 
the distribution defined in Problem 2.9? 


(Sec. 24) Let Х= CX, X2), where X, = X and X; aX +b and X has the 
distribution N(0, 1). Find the cdf of X. 


(Sec. 24) Let X,...,X, be independently distributed, each according to 
№ u, o7). 


(a) What is the distribution of X -(X,,..., Xy)? Find the vector of means 
and the covariance matrix. 


(b) Using Theorem 2.4.4, find the marginal distribution of XY-ZX/N. 
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2.23. (Sec. 2.4) Let X,,..., X, be independently distributed with X; having distri- 


bution N( 8 + yz, a), where z; is a given number, i = 1,..., №, and 3,2; = 0. 


(a) Find the distribution of CX,,..., Xp V. 
(b) Find the distribution of X and g = E X,z;/Ez? for Ez? > 0. 


2.24. (Sec. 2.4) Let (X;, У, (Xz, Yy, (Х., У be independently distributed, 


(X; Y) according to 


i=1,2,3. 


(a) Find the distribution of the six variables. 
(b) Find the distribution of (X, Y Y. 


2.25. (Sec. 2.4) Let X have a (singular) normal distribution with mean 0 and 


covariance matrix 


x 1) 


(a) Prove X is of rank 1. 


( ) in X-a'Y and Y has а nonsingular norm. istribution, and give 
b) Find a $0 а al dist 
g , g 


2.26. (Sec. 2.4) Let 


(a) Find a vector и 0 so that Xu = 0. [Hint: Take cofactors of any column.] 


(b) Show that any matrix of the form G=(H и), where Н is 3X2, has the 
property | | 


G'G- (Fin o). 
0 0 
(c) Using (a) and (b), find B to satisfy (36). 
(d) Find B^! and partition according to (39). 
(e) Verify that CC' — X. 


2.27. (Sec. 2.4) Prove that if the joint (marginal) distribution of X, and X, is 


singular (that is, degenerate); then the joint distribution of X,, Х,, and X, is 
singular. mu” > 
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2.28. 


2.29. 


2.30. 


2.31. 


2.32. 


2.33. 


2.34. 


2.35. 


2.36. 
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. ‚ 2.5) Prove Hadamard's inequali 
(Sec. 2.5) In each part of Problem 2.6, find the conditional distribution of Х 27. (Sec, 2.5) Prov quay 
given Y = y, find the conditional distribution of Y given X =x, and plot each 
regression line on the appropriate graph in Problem 2.6. IX < П оц. 
i-i 
(Sec. 2.5) Let м = 0 and 


[Hint: Using Problem 2.36, prove |X| < o,,1Z;l, where X5 is (р- 0х 
1. 0.80 —0.40 (p — 1), and apply induction.] 
У = 0.80 1. — 0.56 |. 


—0.40  —0.56 l. 238. (Sec. 2.5) Prove equality holds in Problem 2.37 if and only if € is diagonal. 


2.39. (Sec. 2.5) Prove Bi3.3 = 012.3/ 022-3 = Р1з.201.2/ 93.2 and Виз. = 0132/0332 = 


(a) Find the conditional distribution of X, and Хз, given X; =X). 113.201.2/ 03.29 Where aj = Oig 


(b) What is the partial correlation between X, and X, given X,? | 

2.40. (Sec. 2.5) Let (X,, X2) have the density п (x10, £) = f(x хо). Let the density 
of X, given X,=x, be f(x). Let the joint density of Xj, Хә, X; be 
f(xy, x fGslx)). Find the covariance matrix of Xp X;, Хз and the partial 
correlation between X; and X; for given Xj. 


(Sec. 2.5) In Problem 2.9, find the conditional distribution of X, and X, given 
X34 7X4. 


(Sec. 2.5) Verify (20) directly from Theorem 2.5.1. Е 
fy y 241. (Sec. 2.5) Prove 1 - R24, =(1- pX1 — рЬ.3). Hint: Use the fact. that the 

(Sec. 2.5) variance of X, in the conditional distribution given x, and хз is A-R? aon] 
2.42. (Sec. 2.5) If p —2, can there be а difference between the simple correlation 
between X, and x, and the multiple correlation between X, and XO = X? 
Explain. - 


(a) Show that finding © to maximize the absolute value of the correlation 
between X, and а’Х® is equivalent to maximizing (a(;,o)^ subject to 
Q'25,,« constant. 

(b) Find о by maximizing (90)? — (а 5, — с), where c is a constant and 


.43. (Sec. 2.5) Pr 
À is a Lagrange multiplier. 2.43. (Sec. 2.5) Prove 


A 


(Sec. 2.5) Invariance of the multiple correlation coefficient. Prove that R iq, up 
is an invariant characteristic of the multivariate normal distribution of А; and 
XO under the transformation x* =b,x,+¢; for b; 0 and XO* = НХ® +k 


Виан... kl els P 


. . А Fig lyin k-1,k+1,... 
for Н nonsingular and that every function of р» о» Gy Ww, and X», that is = риал, kl e sss P wart M 
; ; ; ; : P 9-15 Vett Ok.qe1,.  k-l kl, P 
invariant is a function of R;,,,. р. 
(Sec. 2.5) Prove that "C del.e.qQ, k=q+1,..., P, where PE eua k-lktlL...p T 
дража, кж su j=i,k. [Hint: Prove this for the special case k=q+1 
А ; Бу using Problem 2.56 with р, = 4, р2= 1, рз=р -4 - 11 
1- А2 L ^j к,ј=9+1,...,р 
Pte? [АТ Prj) i UU 32.44. (Sec. 2.5) Give a necessary and sufficient condition for Ri-q+1, ..., p = 0 in terms 


of 9; 4.1 Fip 


(Sec. 2.5) Find the multiple correlation coefficient between X, and (X5, Хз) 


2.45. (Sec. 2.5) Show 
in Problem 2.29. 


(Sec. 2.5) Prove explicitly that if X is positive definite, 


[Х| = Хи - Zi; Ez! EA Eal. [Hint: Use (19) and (27) successively.] 
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2.46. 


2.47. 


2.48. 


2.49. 


2.50. 


2.51. 


2.52. 


2.53. 
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(Sec. 2.5) Show 


2 = 
Pij.q«1 ERN p= Bia yea р Вн-а+1 nm р’ 


(Sec. 2.5) Prove 


[ Hint: Apply Theorem A.3.2 of the Appendix to the cofactors used to calculate 
ai] 


(Sec. 2.5) Show that for any joint distribution for which the expectations exist 
and any function A(x?) that 


é( X; - éXIXO)A(X9*) =0. 


[ Hint: In the above take the expectation first with respect to X; conditional 
on X] 


(Sec. 2.5) Show that for any function A(x) and any joint distribution of X, 
and XC? for which the relevant expectations exist, &[X,—A(X®)} = SIX; — 
BX) + &le(X™) — (XO), where g(x?) = £X,[xO is the conditional 
expectation of X; given ХО =х®. Hence g( X?) minimizes the mean squared 
ertor of prediction. [ Hint: Use Problem 2.48.] 


(Sec. 2.5) Show that for any function A(x9?) and any joint distribution of X, 
and X for which the relevant expectations exist, the correlation between X; 
and A(X) is not greater than the correlation between X; and g(X@), where 
g(x9) = &X x, 


(Sec. 2.5) Show that for any vector function h(x) 


é[ x - &(X9)] [хо = a(x - e[x? - ex|xO][ x? - сх хо], 
is positive semidefinite. Note this generalizes Theorem 2.5.3 and Problem 2.49. 


(Sec. 25) Verify that Хх = -Wp Wi, where V = X^! is partitioned 
similarly to X. 


(Sec. 2.5) Show 


isl Ens Хи 
-Ep Enb: ХХ УХ + Zz! 
оо 1\_, 
“lo xj + -p Zhe -B) 


where B = У ,5 51. [Hint: Use Theorem А.3.3 of the Appendix and the fact 
that X^! is symmetric.] 
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2.54. (Sec. 2.5) Use Problem 2.53 to show that 


xx= (x - У Хх ух, (х - ХХХ?) +xO Bx, 


2.55. (Sec. 2.5) Show 


8(X Mx, xO) = WO & Xi Xs (a9 — yO) 


+(312- 2333 232) (22 — Уз X2) 
[x2 - po — Z4Xu(x9- y] . 


2.56. (Sec. 2.5) Prove by matrix algebra that 


X =| | Xa 


ЕЕ Xa | = У-У 


7 (Xo - 23253 X5)(X5 -Inf Xs) (Xa =- %23%33 Хз). 


2.57. (Sec. 2.5) Invariance of the partial correlation coefficient. Prove that ру». is 


TEM p 


2.58. (Sec. 2.5) Suppose X(? and ХО of q and p —q components, respectively, 


have the density 


м e- iQ 
Qm)? 


where 
Q - (x9 — u^ A, (x — ро) + (xO — n0) Ap (x — y) 
(x9 = y) A, (x — pO) + (x — u?)' Az (х0 — y). 
Show that Q can be written as О, + Q,, where 
О, = [(x® — a) + Ay of x — 2) 4, [(x0 ~ p®) чаша» x® — и?) ], 
0. = (x – uP) (An -AnA An) (x? - 9). 
Show that the marginal density of X? is 
- “14 3 
142 Andi Ant | io, 
(2r yt? 
Show that the conditional density of XD given XO = х® is 
Тац e 201 
(22) 


(without using the Appendix). This problem is meant to furnish an alternative 
proof of Theorems 2.4.3 and 2.5.1. 
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2.59. (Sec. 2.6) Prove Lemma 2.6.2 in detail. 


2.60. (Sec. 2.6) Let Y be distributed according to М, £). Differentiating the 
2.61. 


2.62. 


2.63. 


2.64. 


characteristic function, verify (25) and (26). . 


(Sec. 2.6) Verify (25) and (26) by using the transformation X — p = CY, where 


У, = СС', and integrating the density of Y. 
(Sec. 2.6) Let the density of (X, Y) be 
2n(xl0,1)n(yl0,1), Osysx<o, 0< -х<у<о, 
0x -yx-x«o, 0<х<-у<о, 


0 otherwise. 


Show that X, Y, X + Y, X — Y each have a marginal normal distribut'on. 


(Sec. 2.6) Suppose X is distributed according to N(0, E). Let X —(o,,... 
Prove 
919 9,01 
&(XX' eXX) -XeXcvecX(vecX) +| : 
9,0; '" 9,9, 
= (1+К)(5 5) + уссХ (уес 5)”, 
where 
о ££] Е рЕ1 
vec Х = , K=| : |, 
9, ££, ЕЕ 


and =; is a column vector with 1 in the ith position апа 0° elsewhere. 


Complex normal distribution. Let (X', Ү'У have a normal distribution with mean 


vector (wy, му)’ and covariance matrix 


(Тг -9 


where Г is positive definite and Ф = – Ф' (skew symmetric). Then Z = X + iY 
is said to have a complex normal distribution with mean Ө = py t Йку and 
covariance matrix &(Z — 9(Z — 0)* = P — Qt iR, where Z* = Х' —iY'. Note 


that P is Hermitian and positive definite. 


(a) Show Q = 2T and К = 2Ф. 
(b) Show |PI? = 1251. [ Hint: Г - i| = Ir - 
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ri 


(с) Show 


(d) Show that the density of X and Y can be written 
a? | P| -lg-G-9Y P N0), 


. Complex normal (continued). № Z has the complex n 


АРА*. 
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p^ = (0+80-'8) cip R(Q e ROR) - 


Note that the inverse of a Hermitian matrix is Hermitian. 


ormal distribution of 


Problem 2.64, show that W = AZ, where A isa nonsingular complex matrix, has 
the complex normal distribution with mean 40 and covariance matrix £F 


2.66. Show that the characteristic function of Z defined in Problem 2.64 is 


H ; *g-py"* 
Феі RUZ) = uiu $-u Ри, 


where A(x + iy) =x. 


* 2.67. (Sec. 2.2) Show that f а e I dx/ т is approximately (1 —е 


[Hint: The probability that (X,Y) falls in a square 


2.68. (Sec. 2.7) For the multivariate f-distribution with d 
éX=p and E(X) - im /Gn — QA. 


-2a! /т)1/2. 
is approximately the 


probability that CX, У) falls in an approximating circle [Pólya (1949). 


ensity (41) show that 


CHAPTER 3 


Estimation of the Mean Vector 
and the Covariance Matrix 


3.1. INTRODUCTION 


The multivariate normal distribution is specified completely by the mean 
vector p and the covariance matrix €. The first statistical problem is how to 
estimate these parameters on the basis of a sample of observations. In 


Section 32 it is shown that the maximum likelihood estimator of p is the 


sample mean; the maximum likelihood estimator of X is proportional to the 
matrix of sample variances and covariances. А sample variance is a sum of 
squares of deviations of observations from the sample mean divided by one 
less than the number of observations in the sample; a sample covariance is 
similarly defined in terms of cross products. The sample covariance matrix is 
an unbiased estimator of X. 

The distribution of the sample mean vector is given in Section 3.3, and it is 
shown how one can test the hypothesis that p is a given vector when X is 
known. The case of € unknown will be treated in Chapter 5. 

Some theoretical properties of the sample mean are given in Section 3.4, 
and the Baves estimator of the population mean is derived for a normal a 
priori distribution. In Section 3.5 the James-Stein estimator is introduced: 
improvements over the sample mean for the mcan squared error loss func- 
tion are discussed. 

In Section 3.6 estimators of the mean vector and covariance matrix of 
elliptically contoured distributions and the distributions of the estimators are 
treated. 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
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32. THE MAXIMUM LIKELIHOOD ESTIMATORS OF THE MEAN 
VECTOR AND THE COVARIANCE MATRIX 


Given a sample of (vector) observations from а p-variate (nondegenerate) 
normal distribution, we ask for estimators of the mean vector р and the 
covariance matrix 5 of the distribution. We shall deduce the maximum 
likelihood estimators. 

It tarns out that the method of maximum likelihood is very useful in 
various estimation and hypothesis testing problems concerning the multivari- 
ate normal distribution. The maximum likelihood estimators or modifications 
of them often have some optimum properties. In the particular case studied 
here, the estimators are asymptotically efficient [Cramér (1946), Sec. 33.3]. 

Suppose our sample of N observations on X distributed according to 
N(p, X) is xy,..., xy, where № p. The likelihood function is 


N 
(1) L- Tni X) 


Т отут" 


. In the likelihood function the vectors x,,...,xy are fixed at the sample 


values and L is a function of p and X. To emphasize that these quantities 
are variables (and not parameters) we shall denote them by p* and X*. Then 
the logarithm of the likelihood function is 


(2) log L = —ipN log2s — 5 № log! X*] 
м | 
-$ X (аа (я. р). 
а=1 


Since log L is an increasing function of L, its maximum is at the same point 
in the space of џ*, Х* as the maximum of Г. The maximum likelihcod 
estimators of р, and X are the vector p* and the positive definite matrix €* 
that maximize log L. (It remains to be scen that the supremum or log L is 
attained for a positive definite matrix X*.) 

Let the sample mean vector be 


N 
(3) #= у Хх. = = |: |, 
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wh - x | 
ere x, = (x,,,..., Xpa) and x, = X ix,,/N, and let the matrix of sums 


of squares and cross. products o of deviations about the mean be 


(4). Az Y (x, -3. у 
, а= 1 
N 
È Qus Ys 73) , i,j=l,...,p. 


It will be convenient to use the following lemma: 


Lemma 3.21. Let x, xy be М ( d let ғ 
ye) p-component) vect 
defined by (3). Then for any vector b po vectors, and let 2 be 


UN р N 


(5) E G.-500. -by = У (x,-X)(x,-X) *N(X-b)x-b). 


Proof 
(6) 


N . N | 
E (005, 7b) = У [Cee 8) + G-9][G. =) + G- 9T 
- | 
= E [G.73) 0.73) + (73) - D) 
*(- b), 73) + (i735) (6-8) 


N : 
= L (x,—x)(x,—X) + 


N 
E (x.-2) (2—6), 


N 
+(¥—-5) r (x73) + ME - (B) 


The second and third terms on the right-hand side are 0 because Y, -¥) 


Ух, — N¥=0 by (3). Г 1 


When we let b = p*, we have 
(7) 
N N ) 
x — p* = gt)! = -F x)! r Y. 
x ав") (х. – и") EG. x)(x,-x)' + N(x- w*)(X- в?) 


АМ) phy’. 
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Using this result and the properties of the trace of a matrix (tr CD = Ус, а; 
= tr DC), we have 


(8) - 
N ! 
У (х, в) (x, -в*) = I (x, WV ET (xa В") 


i aul 


$ str ў, x*7 (s, - Ys YY 


а=1 
= іт #14 + tr XUNN(X-at)(x-a*)y 
гуж + М р) 5700—87) 
Thus ме can write (2) as 
(9) | log L = – 1р№ log(27) – ЗМ logl X*l 
-igxt4-iN(x-w)yxEt(-k») 


Since X* is positive definite, X*'! is positive definite, and N(x— 
py X*7 (2 — p*) > 0 and is 0 if and only if p* = X. To maximize the second 
and third terms of (9) we use the following lemma (which i is also used in láter 
chapters): 


Lemma 3.22. If D is positive definite of order p, the maximum of 
(10) Кв) = = N\log|G| — tr G^! D 


with respect to positive deine matrices G exists, occurs at G=(1/N ур, апі 
has ће value | 


(11) AG] = pN log N – N logi D| — pN. 


: Proof. Let D=EE' and E' G^E- H. Ther G = EH! E', and |G| = IEI 
[B7 1671 = 1874: -|EE'| = 1рі/ 181, and 12671 = tr С-!ЕЕ' = 
tE 'G7!E = tr H. Then the function to be maximized (with respect to posi- 
tive, definite. ВР is 


= —N log|D| + N loglHl — 
‚ where T is lower triangular (Corollary А.1.7). Then the 


f- -N logi DI +N login? -trTT' 


= —N loglDI + È (N log t — t5) — Lu 
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occurs at t2=N, 1; = 0, ij; that is, at Н = NI. Then G=(1/N)EE' = 
(1/N)D. a 


Theorem 3.2.1. If x,,..., xy constitute a sample from Му, X) with p < М, 
the maximum likelihood estimators of y. and X are в — x = (/N)XN x, and 
$ = (1/ NEN.. (x, - xXx, — X)', respectively. 


Other methods of deriving the maximum likelihood estimators have been 
discussed by Anderson and Olkin (1985). See Problems 3.4, 3.8, and 3.12. 

Computation of the estimate $ is made easier by the specialization of 
Lemma 3.2.1 (b = 0) | 


N N 
(14) E (x,-3)x,-3) = È oxx,-Nxw. 
a-l А а= 1 
An element of ZY- x, X, is computed as EY i Xiažja: and an element of 


№’ is computed as Nx,x; or (Ed. ix, XL -i1x;,)/N. It should be noted 
that if N > p. the probability is 1 of drawing a sample so that (14) is positive 
definite; see Problem 3.17. 

The covariance matrix can be written in terms of the variances ог standard 
deviations and correlation coefficients. These are uniquely defined by the 
variances and covariances. We assert that the maximum likelihood estimators 
of functions of the parameters are those functions of the maximum likelihood 
estimators of the parameters. 


Lemma 3.23. Let f(0) be a real-valued function defined on à set 5, and let 
ф be a single-valued function, with a single-valued inverse, on S to a set S*; that 
is, to each 0€ S there corresponds a unique 0* € $*, and, conversely, to each 
8* Е S* there corresponds a unique 0 € S. Let 


(15) g(9*) -f[e7'(e*)]. 


Then if f(@) attains a maximum at 0 = 0, g(0*) attains а maximum at 
0* = 0t = ф(9,). If the maximum of f(8) at Oy is unique, so is the maximum 
of g(0*) at 8j. 


Proof. By hypothesis f(6,) = f(#) for all бє S. Then for any 9* = 5* 
(16  g(8*) -/[07(6*)] =f(8) 791) =8[Ф(60)] =8(00). 
Thus 2(0*) attains a maximum at 00. If the maximum of f(@) at 0, is 


unique, there is strict inequality above for 0 + 8, and the maximum of g(0*) 
is unique. a 
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We have the following corollary: 


Corollary 3.2.1. Jf on the basis of a given sample б, eg б, are maximum 
likelihood estimators of the parameters 0,,...,0,, of a distribution, then 
,(6;, qi 6), e) ACN ee, 6.) are maximum likelihood estiriators of 
di 65... 6)... du 01... Om) if the transformation from 04,,...,9, to 
фі, -.-, Ọm IS опе-0-опе If the estimators of 0,,...,6,, are unique, then the 
estimators of $,,..., Фи are unique. 


Corollary 3.22. If x,,...,xy constitutes a sample from N(p, X), where 
9; = 010; Pij Сри = 1), then the maximum likelihood estimator of р. is pHx= 
(1/N)E, x,; the maximum likelihood estimator of oj is 62 - Q/N)E Gi, — 
xy = 0/NYXX, х2, — №2), where Xia is the ith component of x, and х, is the 
ith component of x; and the maximum likelihood estimator of pij is 


УУ (ха —X;)(Xja -Xj) 


Bj SOE ems 
VEN (tia -x VEN „(х -Xxy 


N _ - 
Хата ја NX;X; 


[SN 2 ON No. ARD 
Y gu. NX rnin NX; 


(17) 


Proof. The set of parameters ш; = Hi o? = oj, and pij = ;;/ (91:0); Ва 
one-to-one transform of the set of parameters м, and о;;. Therefore, by 
Corollary 3.2.1 the estimator of ш; is Др of c? is Ĝ&; and of p; is 


i 


Gj; 


(18) Bij = 


Pearson (1896) gave a justification for this estimator of р», and (17) is 
sometimes called the Pearson correlation coefficient. It is also called the 
simple correlation coefficient. It is usually denoted Бу r;;. 


The assumption that the transformation is one-to-onc is made so that the set p... On 
uniquely defines the likelihood. An alternative in case 0* = ф(0) docs not have a unique inverse 
is to define 5(9*) = (0: ф(0) = 8*) and g(9*)= sup f(6)] 6€ SC8*), which is considered the 
“induced likelihood" when f(0) is the likelihood function. Then 6* = ф(0) maximizes g(6*), 
for 2(0*) = sup f(9)| 0 € 5(0*) > sup ON OES =) =g(6*) for all 0* є5*. [See, e.g. 
Zehna (1966).] 
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uj 


du; 


Figure 3.1 


A convenient geometrical interpretation of this sample (x4, x;,..., Хм) A 


N EM и} -&£&jt 
is in terms of the rows of X. Let 


В ` Figure 3.2 
| 


(19) - Х=|: > toy |. . 1 the second is ха —X, The cosine of the angle between these two vectors is 


| (и: е) (ше) 


(20) Lu Xe) (ше) (ще) (13) 


that is, ш; is the ith row of X. The vector и; can be considered as a vector in 
an N-dimensional space with the ath coordinate of one endpoint being x;, 
and the other endpoint at the origin. Thus the sample is represented by p 
vectors in. N-dimensional Euclidean space. By definition of the Euclidean 


metric, the squared length of и; (that is, the squared distance of one 
endpoint from the other) is u;u; = EN xl 


а=1 Čia 

Now let us show that the соѕіпе of the angle between и; and и; is 
uju;/ y uiu;uju; = INi Xia Xja/ үх, 0х) Choose the scalar d so 
- the vector du, is orthogonal to и; — du;; that is, 0 = du'(u; — и) = d(uju, — 
дши). Therefore, d = uju;/uju;. We decompose и; ito и; – du, and du; 
[u; = Cu; — ди) + du] as indicated in Figure 3.1. The absolute value of the 
cosine of the angle between и; and u; is the length of du, divided by the 
length of uj; that is, it is y du (du;)/um; - du,u;d/u;u,; the cosine is 

ии, / yuuu; . This proves the desired result. ' 


3c Y. Gu 3071) 


АА. 


N м 2 
L (Xia -X) У (х -Xj) 


ami 


1 м - Аз an example of the calculations consider the data in Table 3.1 and 
4 graphed in Figure 3.5, taken from Student (1908). The measurement x, = 1.9 
on the first patient is the increase in the number of hours of sleep due to the 
use of the sedative A, хд = 0.7 is the increase in the number of hours due to 


Tabie 3.1. Increase in Sleep 


Table 5.1. increas? Z T o 
_ B Drug 4 Drug B 
To give a geometric interpretation of aj; and a;;/ /ajj4jj, We introduce ; 5 1 19 097 
the equiangular line, which is the line going through the origin and the point u 2 0.8 -16 
(1, 1,..., 1). See Figure 3.2. The projection of и; on the vector € = (1,1... D a 3 1.1 —0.2 
is (е'и,/е'е)є = (E, x;,/E, De 7X; = (Xp, X; ..., X)". Then we decompose Я 4 0.1 12 
и; into X,e, the projection on the equiangular line, and u,—Xt; tbe К i 791 ay 
projection of u; on the plane perpendicular to the equiangular line. The 1 55 3.7 
squared length of u,;—%,€ is (и, — že) (u; — Xe) = Y, Gn, -%,)’; this is 8 1.6 0.8 
Nó, = aj. Translate и, —X,£ and и; —Х,е, SO thaf.each vector has an end- 9 4.6 0.0 
point at the origin; the ath coordinate of the first vector is x;, — X;, and of 10° 34 2.0 
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x 


Figure 3.3. Increase in sleep. 


sedative B, and so on. Assuming that each pair (i.e., each row in the table) is 
an observation from N(p, £), we find that 


5 aga (2:33 
i 0.75 }’ 
$ [3.61 2.56 
2 = 
(21) > (25 T4 
s= (oe 2.85 
2.85 320]' 


and py. = rj, = 0.7952. (S will be defined later.) 


3.3. THE DISTRIBUTION OF THE SAMPLE MEAN VECTOR; 
INFERENCE CONCERNING THE MEAN WHEN THE COVARIANCE 
MATRIX IS KNOWN 


3.3.1. Distribution Theory 


In the univariate case the mean of a sample is distributed normally and 
independently of the sample variance. Similarly, the sample mean X defined 
in Section 3.2 is distributed normally and independently of X. 
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To prove this result we shall make a transformation of the set of observa- 


‘tion vectors. Because this kind of transformation is used several times in this 


book, we first prove a more general theorem. 


Theorem 3.3.1. Suppose X,,..., Xy are independent, where X, is dis- 
tributed according to М, X). Let C = (с, в) be an МХ М orthogonal matrix. 
Then Y,—Y5.,c,5Xg is distributed according to N(v,, X), where v,— 
EN Савва», 9 = 1,..., №, and Y,,..., Yy are independent. 


Proof. The set of vectors Y,,...,Y, have a joint normal distribution, 
because the entire set of components is a set of linear combinations of the 


components of Х,,..., Xy, which have a joint normal distribution. The 
expected value of Y, is 


N 
(1) $Y,—- 4 У c, X57 Li Cap 4X; 


The covariance matrix between Y, and Y, is 


(2) @(Y,,¥,) = &(¥, — v,)(¥, v) 
N N 
-é È cag Xg - ив) Y c (G7 Me)! 
В=1 e=] 
N 
= X св, (Xp — hg )(X, — Be)’ 
B,e-1 
N 
~ у СавСуг Og, & 
В, ==1 
N 
У савсв X 
B-1 
= $ У, 


where 6,, is the Kronecker delta (=1 if а= у and =0 if a y). 
This shows that Y, is independent of Y,, а + y, and Y, has the covariance 
matrix X. и 
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We also use the following general lemma: 


` Lemma 3.34. If C — (c5) is orthogonal, then УМ x, x, = XN. ye’ 
where y, = Epai Capžp а= l,..., N. a T Dania Vas 


Proof 
N 
(3) У уу, = У L capXp 2, Ca, X, 
а=1 a В y 
= У (Усс, хех, 
В.у` а 
= У, xs, 
В. у 
= хв. и 
B 


Let Xir Xy be independent, each distributed according to N(p, £). 
There exists ап N X N orthogonal matrix B —(b,5) with the last row 


(4) | (1/VN,...,1/VN). 


(See Lemma A42.) This transformation is a rotation in the N-dimensional 
space described in Section 3.2 with the equiangular line going into the Nth 
coordinate axis. Let 4 — N X, defined in Section 3.2, and let 


N 
(5) Z,- Y, b. Xs. 
B-1 
Then 
(6) ) хо 
Zy= Y bygXo7 Y, ——X,- МХ 
Eh NB ŽB > /N ^? . 
By Lemma 3.3.1 we have 
N 
(7) A= Y X,X,- NXX' 
а=1 


45 THE DISTRIBUTION ОЕ THE SAMPLE MEAN VECTOR 77 


Since Zy is independent of 2;,...,2м-1› the mean vector X is independent 
of A. Since 


N N 
1 
(8) EZy= Y by 6X,7 L mw VNB, 
N i Np 9 Ав ics YN 


Z,, is distributed according to NN в, X) and X = (1/ VN )Z,, is distributed 
according to N[p,(1/N)2]. We note 


N N 
(9) Za У bag 6 Xs = У bap u 
В=1 В=1 


N 
bg bug VN в. 
В=1 
= 0, а N. 
Theorem 3.3.2. The mean of a sample of size N from N(w, È) is distributed 
according to М, (1/N)X] and independently of $, the maximum likelihood 
estimator of X. NX is distributed as EFLIZ,Z,, where Z, is distributed 


а Ға? 


according to N(0, X), а = 1,..., N 1, and Z,,..., Zu. are independent. 


Definition 3.3.1. An estimator t of a parameter vector 8 is unbiased if and 
only if at= Ө. 


Since £X = (1/N) LY., X, = в, the sample mean is an unbiased estima- 
tor of the population mean. However, 


a 01, n N-1 
(10) Ê= wed Z.Z, = М. 


Thus $ is a biased estimator of 5. We shall therefore define 


; м 
(11) S- X 47 ут Y 90.» 


as the sample covariance matrix. Yt is an unbiased estimator of € and the 
diagonal elements are the usual (unbiased) sample variances of the compo- 
nents of X. 


3.32. Tests and Confidence Regions for the Mean Vector When the 
Covariance Matrix Is Known 


A statistical problem of considerable importance is that of testing the 
hypothesis that the mean vector of a normal distribution is a given vector, 
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and a related problem is that of giving a confidence region for the unknown 
vector of means. We now go on to study these problems under the assump- 
tion that the covariance matrix X is known. In Chapter 5 we consider these 
problems when the covariance matrix is unknown. 

In the univariate case one bases a test or a confidence interval on the fact 
that the difference between the sample mean and the population mean is 
normally distributed with mean zero and known variance; then tables of the 
normal distribution can be used to set up significance points or to compute 
confidence intervals. In the multivariate case one uses the fact that the 
difference between the sample mean vector and the population mean vector 
is normally distributed with mean vector zero and known covariance matrix. 
One could set up limits for each component on the basis of the distribution, 
but this procedure has the disadvantages that the choice of limits is some- 
what arbitrary and in the case of tests leads to tests that may be very poor 
against some alternatives, and, moreover, such limits are difficult to compute 
because tables are available only for the bivariate case. The procedures given 
below, however, are easily computed and furthermore can be given general 
intuitive and theoretical justifications. 

The procedures and evaluation of their properties are based on the 
following theorem: 


Theorem 3.3.3. Jf the m-component vector Y is distributed according to 
N(v, T) (nonsingular), then Y'T Y is distributed according to the noncentral 
x^ distribution with m degrees of freedom and noncentrality parameter v'T^!v. 
If v = 0, the distribution is the central x? distribution. 


Proof Let C be a nonsingular matrix such that CTC’ = I, and define 
Z = CY. Then Z is normally distributed with mean 22 =C&Y=Cv= À, Say, 
and covariance matrix (Z - AXZ - А) = CY - vXY ~ vy C = CIC' =1. 
Then УТУ = Z(C')! T! C^!Z = Z(CTC') ! Z = Z'Z, which is the sum of 
squares of the components of Z. Similarly v'T- 1» = АА. Thus Y'T !Y is 
distributed as У” 22, where Z,,...,Z,, are independently normally dis- 
tributed with means A,,..., Àm, respectively, and variances 1. By definition 
this distributic n is the noncentral x?-distribution with noncentrality parame- 
ter E^. , A2. See Section 3.3.3. If A, = © =A, = 0, the distribution is central. 
(See Problem 7.5.) L| 


Since /N(X — в) is distributed according to №0, €), it follows from the 
theorem that 


(12) м(Х-в)' >" (X- p) 
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has a (central) x?-distribution with р degrees of freedom. This is the 
fundamental fact we use in setting up tests and confidence regions concern- 
ing p. 

Let x2(a) be the number such that 


(13) Pr{ x? > x2(a)] = а. 
Thus 
(14) Pr(N(X- и)’ (X- в) > (а) =. 


To test the hypothesis that p = що, where po is a specified vector, we use as 
our critical region 


(15) N(x—- po) X I(E- po) > xp (a). 


If we obtain a sample such that (15) is satisfied, we reject the null hypothesis. 
It can be seen intuitively that the probability is greater than « of rejecting 
the hypothesis if в, is very different from pọ, since in the space of x (15) 
defines an ellipsoid with center at ро, and when p is far from ро the density 
of X will be concentrated at a point near the edge or outside of the ellipsoid. 
The quantity N(X — p) E (X — mo) is distributed as a noncentral y^ with 
p degrees of freedom and noncentrality parameter №, — p X717 ро) 
when X is the mean of a sample of N from Мф, X) [given by Bose 
(19362), (1936b)]. Pearson (1900) first proved Theorem 3.3.3 for v = 0. 

Now consider the following statement made on the basis of a sample with 
mean X: “The mean of the distribution satisfies 


09 N(x- wy X (£-w*) < x(a) 


as an inequality on p*.” We see from (14) that the probability that а sample 
will be drawn such that the above statement is true is 1 — а because the 
event in (14) is equivalent to the statement being false. Thus, the set of и 
satisfying (16) is a confidence region for p with confidence 1 — a. 

In the p-dimensional space of ¥, (15) is the surface and exterior of an 
ellipsoid with center мо, the shape of the ellipsoid depending on €^! and 
the size on (1/N)x7(a) for given X-!.In the p-dimensional space of м" 
(16) is the surface and interior of an ellipsoid with its center at X. If X^! = I, 
then (14) says that the probability is a that the distance between X and p is 
greater than V x; (a)/N. 
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Theorem 3.3.4. If x is the mean of a sample of N drawn from Му, X) and 
X is known, then (15) gives a critical region of size a for testing the hypothesis 


№ = Mo, and (16) gives а confidence region for y of confidence 1 — a. Here 
Xj (a) is chosen to satisfy (13). 


The same technique can be used for the corresponding two-sample prob- 
lems. Suppose we have a sample {x}, а= 1,..., №, from the distribution 
Ми, X), and a sample {x}, а = 1,..., №,, from a second normal popula- 
tion NC р”, X) with the same covariance matrix. Then the two sample 
means 


N 2 
(17) yO = x Y x), х0 = d x2) 
l а=1 


2 а=] 
are distributed independently according to М pM, (1/N,)2%] and 
Ми, (1/N;)X], respectively. The difference of the two sample means, 


y =£ — #0), is distributed according to Ntv,[(1 /N) + а/м, where 
v = p — aO. Thus 


М, 1 
(18) МАО») Аа) 


is а confidence region for the difference v of the two mean vectors, and a 
critical region for testing the hypothesis в = р is given by 


NN, oup соу 2 
(19) . N, FN; GO -3?)'X (x0 739) > (а). 


Mahalanobis (1930) suggested (рб) — д); (ро — x) as а measure of 
the distance squared between two populations. Let C be a matrix such that 
X = СС’ and let vO = C^! p^, {= 1,2. Then the distance squared is (v? — 
v)'(v — 00), which is the Euclidean distance squared. 


3.3.3. The Noncentral x?-Distribution; the Power Function 


The power function of the test (15) of the null hypothesis that E. = р, сап be 
evaluated from the noncentral y?-distribution. The central y?-distribution is 
the distribution of the sum of squares of independent (scalar) normal 
variables with means 0 and variances 1; the noncentral x?-distribution is the 
generalization of this when the means may be different from 0. Let Y (of p 
components) be distributed according to N(A, Г). Let О be ап orthogonal 
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matrix with elements of the first row being 


Ài 
(20) 4: Ух’ 


Then Z = ОУ is distributed according to Мт, Г), where 


(21) T= 


and т= УХА. Let V- Y - Z'ZZ- Р 127. Then W-LPL,Z has a x^- 
distribution with p — 1 degrees of freedom (Problem 7.5) and Z, and W 
have as joint density 


1 рем 
(22) Yan 2-YT[3(p—1)] 


= Се- Harzit yra oTi 


xX 
_Црчичи) i p-3) —- 
= Се- i ++ КР X al 
а = 0 


where С! = 2¥VaI[4(p — D). The joint density of V= W+Z? and Z, is 
obtained by substituting w = о — z? (the Jacobian being 1x 

æ аа 

| Mp- «^ 72] 

(23) Сет 1C (y _ iy p У ль 


а=0 
The joint density of V and U = Z,/ VV is (dz, = Vudu) 


рз) © терец“ 
x -= 
(24) Con ma и) E a 


а=0 


The admissible range of z, given v is — Ул to уо, and the admissible range 
of и is —1 to 1. When we integrate (24) with respect to и term by term, т 
terms for a odd integrate to 0, since such a term is an odd function of и. In 
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the other integrations we substitute и = Vs (du = 145 / Vs ) to obtain 


(25) Г a у 7228 du u-2f (1 =u?) 97 y28 du 


- ра –5)2358-245 
0 


-B[:(p - 0. 8 1] 
. Tcp - D]r(8 + 3) 


Г(5р + В) , 
by the usual properties of the beta and gamma functions. Thus the density of | ` 
V is 
1 (72) v8 ов r( B+3) 
26) : Hr? +o) peo 1 2 
la” У One (6)! TOP TEY 


We can use the duplication formula for the gamma function ГОВ + 1) = (28)! 
(Problem 7.37), 


(27) Г(28+1) =Г(В+ D)T( B+ 1)22В/ут, 
to rewrite (26) as 

d = Мт? +v) т? P 1 
(28) 22 ez | ВГС рр)” 


This is the density of the noncentral x?-distribution with p degrees of free- 
dom and noncentrality parameter 7?. 


Theorem 3.3.5. If Y of p components is distributed according to ММ, I), 
then V = Y'Y has the density (28), where т? = №'А. 


To obtain the power function of the test (15), we note that YN (È — во) 
has the distribution N[VN (м — шо), €]. From Theorem 3.3.3 we obtain the 
following corollary: 


Corollary 3.3.1. If X is the mean of a random sample of N drawn from 
Ми, X), then МХ — p Y X^! (X — во) has а noncentral x?-distribution with p 
degrees of freedom and noncent.ality parameter N(p. — po) € ^! qx — во). 
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3.4. THEORETICAL PROPERTIES OF ESTIMATORS 
OF THE MEAN VECTOR 


3.4.1. Properties of Maximum Likelihood Estimators 


It was shown in Section 3.3.1 that x and S are unbiased estimators of qx and 
У, respectively. In this subsection we shall show that x and S are sufficient 
statistics and are complete. 


Sufficiency 

A statistic Т is sufficient for a family of distributions of X or for a parameter 
Ө if the conditional distribution of X given T = £ does not depend on Ө [e.g., 
Cramér (1946), Section 32.4]. In this sense the statistic Т gives as much 
information about 0 as the entire sample X. (Of course, this idea depends 
strictly on the assumed family of distributions.) 


Factorization Theorem. A statistic Ку) is sufficient for Ө if and only if the 
density f(y| 6) can be factored as 


(1) f(x18) =&[Ку), 9] A9). 
where g[t(y), Ө] and h(y) are nonnegative and h(y) does not depend оп ©. 


Theorem 3.41. If x,,..., xy are observations from N(x, X), then X and S 
are sufficient for p and У. If p is given, XN (x, — их, — в)" is sufficient for 
X. If X is given, X is sufficient for p. 


Proof. The density of X,,..., Xy is 
N 


(2) I Inl, X) 


a= 


1 1 l N 
= (27) "||?" exp| -3tr X^! У (ea в)(х, - м), 


а=1 
= (27) 257 ехр{ -- [МЕ - в)" (£- в) + (У 10) 715]). 


The right-hand side of (2) is in the form of (1) for x, S, в, €, and the middle 
is in the form of (1) for EN. (x, - Xx, — в), X; in each case A(x,,..., xy) 
= ]. The right-hand side is in the form of (1) for X, p with h(x,,..., xy) = 
exp - ¿(N — Dtr X^!5). и 


Note that if € is given, x is sufficient for ра, but if p is given, S is not 
sufficient for X. 
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Completeness 
T oa: 
о prove an optimality property. of the T?-test (Section 5.5), we need the 


А а, " 
esult that (x, S) is a complete sufficient set of statistics for (р, №) 


Definition 3.4.1. А famil nho 
nal ly of distribut j ; . 
for every real-valued function g(y), ions of y indexed by Ө is complete if 


(3) #8(у) =0 


entically т Ө implies g( y) = 0 except for a set of y of probability 0 for every Ө 


If the family of dist р > 

ributions of a sufficient Set of statistics 1$ com lete the 
f sta 

set 1$ called a complete sufficient set. 


Theorem 3.4.2. The sufficien isti 
4.2. t set x, Si. 
the sample is drawn from Ns. D of statistics X, S is complete for p, X, when 


; Md We can define the sample in terms of X and z,,...,z, as in Section 
. n = N — 1. We assume for any function g(x, A) -g(, n$) that 


(4) fo fx ЕЕ Y zaza) 
a=1 


ев ма B)X'(X-pu)- È «xs 


а=1 


n 
dx Пе. = 0, Vp, X, 


а=1 


wh - -à x 
ere К = VN Qm)" 2“, а = ПР. dx,, and dz, = ПР, dzia. If we let X^! 


=1-2®, where Ө= Ө’ and 1- . а, a 
(1 - 20)^!t, then (4) is and I~2 is positive definite, and let p = 


(5) o= f fKir-201s[z, X taza 
а=1 
р -26) У zz «Nee 
а=1 
eser зву a Las 
a-l ° 
= |1-20|* e(- iN (1-20)! r] f = fes, B – №) 


-exp[tr OB 9 £'(Nx)]n[xl0, (1/N) 17] П n(z,10, Г) dx Il dz 
a=! a=] 
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where В = У" 12,2, + Nxx'. Thus 
(6) 0 = &g( x, B— №) exp[tr OB +t'( Nx)| 


- [= fe, B- NE) окр OB CN] RC BYE 


where h(x, В) is the joint density of X and B and dB = [1;. j dbi; The 


. right-hand side of (6) is the Laplace transform of g(X, B - МНС, B). 


Since this is 0, g(x, А) = 0 except for a set of measure 0. " 


Efficiency 
На q-component random vector Y has mean vector 6Y=v and covariance 
matrix &(Y — vXY — v)' = W, then 


(7) (y-v) V^ (y-»)74*? 


is called the concentration ellipsoid of Y. [See Cramér (1946), p. 300.] The 
density defined by a uniform distribution over the interior of this ellipsoid 
has the same mean vector and covariance matrix as Y. (See Problem 2.14.) 
Let Ө be a vector of q parameters in a distribution, and let ғ be a vector of 
unbiased estimators (that is, ót— Ө) based оп № observations from that 
distribution with covariance matrix W. Then the ellipsoid 


ЕДЕ ЕЕ 


lies entirely within the ellipsoid of concentration of t; à log f/ 99 denotes the 
column vector of derivatives of the density of the distribution (or probability 
function) with respect to the components of 0. The discussion by Cramér 
(1946, p. 495) is in terms of scalar observations, but it is clear that it holds 
true for vector observations. If (8) is the ellipsoid of concentration of t, then 
t is said to be efficient. In general, the ratio of the volume of (8) to that of 
the ellipsoid of concentration defines the efficiency of г. In the case of the 
multivariate normal distribution, if © = p, then x is efficient. Tf Ө includes 
both p and X, then X and 5 have efficiency [(N – D/N)]^** "^. Under 
suitable regularity conditions, which are satisfied by the multivariate normal 
distribution, 


о у 


2 lo 
90 00 9000’° 


This is the information matrix for one observation. The Cramér-Rao lower 
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bound is that for any unbiased estimator ¢ the matrix 


; o log fl 
(10) Né(t— 8)(t— 8) -Í аси 


is positive semidefinite. (Other lower bounds can also be given.) 


Consistency 
Definition 3.4.2. A sequence of vectors t, = (t... tuu, n 1,2,..., is 
а consistent estimator of Ө = (0,,...,0,)' if plim „solin = 06, = L,..., m. - 


By the law of large numbers each component of the sample mean x is a 
consistent estimator of that component of the vector of expected values p if 
the observation vectors are independently and identically distributed with 
mean р, and hence X is a consistent estimator of qj. Normality is not 
involved. 

An element of the sample covariance matrix is 


N 
(11) $,— X X (xi. = ш)(х — ш) = HG- ш); = ш) 


aci 


by Lemma 3.2.1 with b = p. The probability limit of the second term is 0. 
The probability limit of the first term is o; if ху, x;,... are independently 
and identically distributed with mean p and covariance matrix £. Then $ is 
a consistent estimator of X. 


Asymptotic Normality 
First we prove a multivariate central limit theorem. 


Theorem 3.43. Let the m-component vectors Y,,Y,,... be independently 
and identically distributed with means  &Y,— v and covariance matrices 
&Y, — vXY, —v)' = Т. Then the limiting distribution of (1/ Vn Ys (Y, — v) 
as n > oo is МО, T). 


Proof. Let 
1 n 
12) ó,(t,u) = 6 exp| iut’ — Y,—-v)l, 
(12 (60) рр те У Lov) 


where и is a scalar and ¢ an m-component vector. For fixed t, $,(£, и) сап be 
considered as the characteristic function of (1/ Yn )E2..,('Y, — &t'Y,). By 
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the univariate central limit theorem [Cramér (1946), p. 215], the limiting 
distribution is N(0, ¢’Tt). Therefore (Theorem 2.6.4), 


(13) lim $, (ви) =e" $T 
noe 


for every и and ¢. (For £ = 0 a special and obvious argument is used.) Let 
и = 1 to obtain 


1 € " 
14 lim £ expl it' — Y,-v)|2e*'"" 
( ) noo хр vn x ( а ) 
for every t. Since e^ "7 is continuous at £ = 0, the convergence is uniform in 
some neighborhood of t = 0. The theorem follows. " 


Now we wish to show that the sample covariance matrix is asymptotically 
normally distributed as the sample size increases. 


Theorem 3.44. Let A(n) = EJ. (ХХ, ХХ, — Xy)', where Xy, X,,... 
are independently distributed according to Мы, X) and n = № — 1. Then the 
limiting distribution of B(n) = (1/ Yn A(n) — пу] is normal with mean 0 and 
covariances 


(15) é b (n)b,(n) = ai, je + био. 


Proof. As shown earlier, A(n) is distributed as A(n) = 35.12, Za, where 


Z,,Z,,... are distributed independently according to N(0, X). We arrange 
the elements of Z, Z;, іп a vector such as 


Zia 
ZiaZza 


( 16) Y, = Zi, 


2 
25а 


the moments of Y, can be deduced from the moments of Z, as given in 
Section 2.6. We have 4Zj,Z,,—90;, €ZjqZjaZ ka Zi, = %j Fu биол + 
оно» (21,2, — 90а а 7 бм) = Oik Tp t бб. Thus the vectors Y, 


defined by (16) satisfy the conditions of Theorem 3.43 with the elements 
of v being the eleraents of X arranged in vector form similar to (16) 
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and the elements of T being given above. If the elements of A(n) are 
arranged in vector form similar to (16), say the vector W(n), then W(n) – пъ 
= Y^ (Y, – v). By Theorem 3.4.3, (1/ Yn W(n) — nv] has a limiting normal 
distribution with mean 0 and the covariance matrix of Ү,. n 


The elements of В(и) will have a limiting normal distribution with mean 0 
if ху, x,,... are independently and identically distributed with finite fourth- 
order moments, but the covariance structure of В(п) will depend on the 
fourth-order moments. 


3.4.2. Decision Theory 


It may be enlightening to consider estimation in terms of decision theory. We 
review some of the concepts. Ап observation x is made on a random variable 
X (which may be a vector) whose distribution P, depends on a parameter 6 
which is an element of a set @. The statistician is to make a decision d in a 
set D. A decision procedure is a function 8(x) whose domain is the set of 
values of X and whose range is D. The loss in making decision d when the 
distribution is P, is a nonnegative function L(0,d). The evaluation of a 
procedure 8(x) is on the basis of the risk function 


(17) R(6,8) = 4,110, 8(X)]. 


For example, if d and 0 are univariate, the loss may be squared error, 
140,4) = (0 — d}, and the risk is the mean squared error &[8(X) — oF. 
A decision procedure 5(х) is as good as a procedure 5*(х) if 


(18) . (0,8) x R(6,8*), V6; 


8(x) is better than 8*(x) if (18) holds with a strict inequality for at least one 
value of 0. A procedure 8*(x) is inadmissible if there exists another proce- 
dure 8(x) that is better than 5*(x). A procedure is admissible if it is not 
inadmissible (i.e., if there is no procedure better than it) in terms of the given 
loss function. A class of procedures is complete if for any procedure not in 
the class there is a better procedure in the class. The class is minimal 
complete if it does not contain a proper complete subclass. If a minimal 
complete class exists, it is identical to the class of admissible procedures. 
When such a class is available, there is no (mathematical) need to use a 
procedure outside the minimal complete class. Sometimes it is convenient to 
refer to ап essentially complete class, which is a class of procedures such that 


for every procedure outside the class there is one in the class that is just as 
good. 
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For a given procedure the risk function is a function of the parameter. If 


the parameter can be assigned an a priori distribution, say, with density pA), 
then the average loss from use of a decision procedure 8(x) is 


(19) r( p,8) = &,R(6,8) = &, & L[6, 8(X)]- 


Given the a priori density p, the decision procedure &(x) that minimizes 
r( p, 8) is the Bayes procedure, and the resulting minimum of r( р, 8) is the 
Bayes risk. Under general conditions Bayes procedures are admissible and 
admissible procedures are Bayes or limits of Bayes procedures. If the density 
of X given 6 is f(x|@), the joint density of X and 9 is f(x 0) 9(0) and the 
average risk of a procedure &(x) is 


(20) r( p,8) = 219. d(x] F218) pCO) dede 


- Довод ae fon 4 


here 


( f(xl8) o(0) 
(21) f(x) = f 8) o(0) 48, gu) = EPA 


are the marginal density of X and the a posteriori density of 0 given x. The 
procedure that minimizes (p, ô) is one that for each x minimizes the 
expression in braces on the right-hand side of (20), that is, the expectation of 
149, 8(x)] with respect to the a posteriori distribution. If 0 and d are vectors 
(Ө and d) and L(0, 4) = (0 — dy Q(0 — d), where О is positive definite, then 


Q2) 6..6, d()] = 16 - (01х)1'019 - €(lx)] 

+ €(@lx) - a(x)]'gL (61) - (x). 
The minimum occurs at d(x) = &(0|x), the mean of the a posteriori distribu- 
tion. 


Theorem 3.4.5. Jfx,....xw are independently distributed, each х, accord- 
ing to №, X), and if p has an a priori distribution N(v, Ф), then the a 
posteriori distribution of y. givenx,,...,Xy IS normal with mean 


-1 
l.l 
(23) (+ 53] i+ P2(@+ ух) v 
and covariance matrix 


(24) e-e(os Ai) 9 
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Proof. Since x is sufficient for џи, we need only consider x, which has the 
distribution of р + v, where v has the distribution Мо, (1/N)X] and is 
independent of p. Then the joint distribution of р, and X is 
o o 


1 
Ф Ф + > 


v 


(25) N 


v , 


The mean of the conditional distribution of p given X is (by Theorem 2.5.1) 
1 -1 

(26) v+0(@+ pE] (X—v), 

which reduces to (23). L| 


Corollary 3.4.1. Jf x,,..., xy are independently distributed, each x, ac- 
cording to N(p, €), и. has an a priori distribution N(v, Ф), and the loss function 
is (d — p) Q(d — y), then the Bayes estimator of y is (23). 


The Bayes estimator of p is a kind of weighted average of X and v, the 
prior mean of p. If (1/N)X is small compared to Ф (e.g., if N is large), v is 
given little weight. Put another way, if Ф is large, that is, the prior is 
relatively uninformative, a large weight is put on х. In fact, as Ф tends to oo 
in the sense that P^! > 0, the estimator approaches X. 

A decision procedure 8)(x) is minimax if 


(27) sup R( 6, 89) = inf sup R( 6, 8). 
9 в 


Theorem 3.4.6. Jf x,,..., xy are independently distributed each according 
to Мр, X) and the loss function is (d —)'Q(d — p), then X is a minimax 
estimator. 


Proof. This follows from a theorem in statistical decision theory that if a 
procedure 8, is extended Bayes [і.е., if for arbitrary e, r( p, ё) < r( p, 8) += 
for suitable p, where 8, is the corresponding Bayes procedure] and if 
R(8, 8,) is constant, then 8) is minimax. [See, e.g., Ferguson (1967), Theo- 
rem 3 of Section 2.11.] We find 


(28) R(p, 5) = £(x - p)'Q(X- р) 
-étrQ(x-p)(x—-p) 
= ne. 
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Let (23) be d(X). Its average risk is 


(29) & & {tr Ola) -n][4(3) - в] 13) 


1\7! 1\11. 
из) »|=190(0+ 13) NT 
1 y!1 1 
-uo(r- yx wN*owUOX 


as ^! 0. и 


For more discussion of decision theory see Ferguson (1967). DeGroot 
(1970), or Berger (1980b). 


3.5. IMPROVED ESTIMATION OF THE MEAN 


3.5.1. Introduction 


The sample mean X seems the natural estimator of the population mean р 
based on a sample from N(p, X). It is the maximum likelihood estimator, а 
sufficient statistic when У is known, and the minimum variance unbiased 
estimator. Moreover, it is equivariant in the sense that if an arbitrary vector v 
is added to each observation vector and to p, the error of estimation 
(x+v)—(pt+v)=x%-~p is independent of v; in other words, the error 
does not depend on the choice of origin. However, Stein (1956b) showed the 
startling fact that this conventional estimator is not admissible with respect to 
the loss function that is the sum of mean squared errors of the components 
when X =I and p > 3. James and Stein (1961) produced an estimator which 
has a smaller sum of mean squared errors; this estimator will be studied in 
Section 3.5.2. Subsequent studies have shown that the phenomenon is 
widespread and the implications imperative. 


3.52. The James-Stein Estimator 
The loss function 
Р 2 2 
(1) Дв,т) = (т-р) (т-вы) = } (m;- my =1т– pl 
i=] 


is the sum of mean squared errors of the components of the estimator. We 
shall show [James and Stein (1961)] that the sample mean is inadmissible by 
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displaying an alternative estimator that has a smaller expected loss for every 
mean vector р. We assume that the normal distribution sampled has covari- 
ance matrix proportional to 7 with the constant of proportionality known. It 
will be convenient to take this constant to be such that У = (1/N)L2_,X, =X 
has the distribution N(p, I). Then the expected loss or risk of the estimator 


Y is simply СУ — в? = tr 7 =p. The estimator proposed by James and Stein 
is (essentially) 


(2) my) -[i- a 2 оъ) +», 


where v is an arbitrary fixed vector and р > 3. This estimator shrinks the 
observed y toward the specified v. The amount of shrinkage is negligible if y 
is very different from v and is considerable if y is close to v. In this sense v 
is a favored point. 


Theorem 3.5.1. With respect to the loss function (1), the risk of the estima- 
tor (2) is less than the risk of the estimator Y for p 7 3. 


We shall show that the risk of Y minus the risk of (2) is positive by 
applying the following lemma due to Stein (1974). 


Lemma 3.5.1. Jf f(x) is a function such that 
(3) (0 f) Ка) = рро) dx 
for all a and b (a <b) and if 


е- 0-9? dx < оо. 


(4) Ги 


then 


(6) ffo 


1-0 ? 1 бо 
1 dx = '( x e glx dx. 
axe f fo 
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Proof of Lemma. We write the left-hand side of (5) as 


eg H9» dx 


(6) f, Uto FOE 


«f Uto OGUE oe tt a 


O P. | 1 = 10х-0) 
горе” ee 


рро 0 уре а 


1 - 4 -8»* 
е7 Y* dxd 
v2 У 


which yields the right-hand side of (5). Fubini's theorem justifies the inter- 
change of order of integration. (See Problem 3.22) и 


The lemma can also be derived by integration by parts in special cases. 


Proof of Theorem 3.5.1. The difference in risks is 
(7) AR(p) = & {IY - wll? = ту) — ad^] 
2 2 
-2 Y-v)+v- | | 
| ec утв 


- aLa- uy — Efa- n] 


IY — vli 


= sir- pl - 


- а оу (72) 
а Jg ИВ у gm) 
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Now we use Lemma 3.5.1 with 
І 2(y; = v 
Foo о 
щи 2 
Fame восы 
j ja 
[For p > 3 the condition (4) is satisfied.] Then (7) is 


— _ 20% Vi ; _ (p-2y | 
(9) ARCH) = so о v w- = e 


(8) uy Ру) = 
Lo, 


1 
é,——,, >0 
"MY — vi? 


This theorem states that } is inadmissible for estimating м when р>: 3, 
since the estimator (2) has a smaller risk for every (regardless of the choice 
of v). 

The risk is the sum of the mean squared errors ó[mj(Y) — p, Y. Since 
Y,..... Y, are independent and only the distribution of Y, depends on д, it is 
puzzling "that the improved estimator uses all the Y/'s to estimate 4u; it seems 
that irrelevant information is being used. Stein explained the phenomenon by 
arguing that the sample distance squared of Y from v, that is, ly — vil’, 
overestimates the squared distance of м. from v and hence that the estimator 
Y could be improved by bringing it nearer v (whatever v is). Berger (19802), 
following Brown, illustrated by Figure 3.4. The four points x,,X?, Хз, X, 
represent a spherical distribution centered at p. Consider the effects of 
shrinkage. The average distance of m(x;) and m(x;) from р is a little greater 
than that of x, and x3, but m(x;) and m(x,) are a little closer to р than x; 
and x, are if the shrinkage is a certain amount. If p = 3, there are two more 
points (not on the line v, р) that are shrunk closer to p. 


-(p-2) 4 


mixa) 
ы, x 


Figure 3.4, Effect of shrinkage. 
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The risk of the estimator (2) is 


(10 ёту) - pl? = - sa 


where ||Y — 02 has a noncentral x?-distribution with p degrees of freedom 
and noncentrality parameter |, — v||*. The farther p is from v, the less the 
improvement due to the James-Stein estimator, but there is always some 


improvement: The density of |У — = = V, say, is (28) of Section 3.3.3, where 
= |р — vit. Then 


(П) £, —— = 2! 
би & 


оо 24,8 
азда 1 
Е =| — ОИ ‘p+ B-2 ,— lo 
У (5 grips, A " e "dv 


з т) гара В 1)22+8-! 
в=0 4 BIT(Gp+B) 
we [Ро 
2 У | > ВР + B- 1) 


for p > 3. Note that for p = v, that is, 7? = 0, (11) is 1/(p — 2) and the mean 
squared error (10) is 2. For large р the reduction in risk is considerable. 

Table 3.2 gives values of the risk for p = 10 and с? = 1. For example, if 
т? = |. — vll? is 5, the mean squared error of the James~Stein estimator is 
8.86, compared to 10 for the natural estimator; this is the case if m; — 
1/42 = 0.707, i = 1,...,10, for instance. 


и = 


Table 3.21. Average Mean Squared Error of the 
James-Stein Estimator for p = 10 and o? = 1 


z? - |i — vl? Elim) — wl? 
0.0 2.00 
0.5 4.78 
1.0 6.21 
2.0 7.51 
3.0 8.24 
4.0 8.62 
5.0 8.86 
6.0 9.03 


From Efron and Morris (1977). 
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An obvious question in using an estimator of this class is how to choose 
the vector v toward which the observed mean vector is shrunk; any v yields 
an estimator better than the natural one. However, as seen from Table 3.2, 
the improvement is small if || — wll is very large. Thus, to be effective some 
knowledge of the position of р, is necessary. A disadvantage of the procedure 
is that it is not objective; the choice of v is up to the investigator. 

A feature of the estimator we have been studying that seems disadvanta- 
geous is that for small values of [lY — vll, the multiplier of Y — v is negative; 
that is, the estimator m(Y) is in the direction from v opposite to that of Y. 
This disadvantage can be overcome and the estimator improved by replacing 
the factor by 0 when the factor is negative. 


Definition 3.5.1. For any function g(u), let 

(12) 8* (и) =8(и), (и) 20, 
=0, g(u) <0. 

Lemma 3.5.2. When X is distributed according to Му, Г), 
(13) e (lgt (WUXI) X — в} < ее) X — wll’). 

Proof. The right-hand side of (13) minus the left-hand side is 
(14) ede (IX IDIXIP — [g* (XI) ici?) = 0 
plus 2 times 


(15) &,w' X[* (IIXI) —g(IXID] 


- p f^ Ју" (lil) 81509 


Р 
eel - i У xà - 2y Mall + wr dy, 


i=1 


Qr)? 
where у’ = х'Р, (|\pll,0,...,0)=p'P, and РР’ = I. [The first column of P is 
(1 Ліры.) Then (15) is llall times 
(16) eK of yilg* (ПУ!) gdy] [e — e71] 


. 1 
(27)? 


ет 321-15! dy, dy, >> dy, 2.0 


(by replacing y, by —y, for y, < 0). и 
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Theorem 3.5.2. The estimator 


(17) тео) [t= s) (yw) ty 


has smaller risk thar. m(y) defined by (2) and is minimax. 


Proof. In Lemma 3.5.2, let g(u)=1-(p—2)/u* and Х=У-т, and 
replace р by п-т. The second assertion in the theorem follows from 
Theorem 3.4.6. a 


The theorem shows that m(Y) is not admissible. However, it is known that 
m*(Y) is also not admissible, but it is believed that not much further 
improvement is possible. 

This approach is easily extended to the case where one observes xj... Xv 
from N(p, X) with loss function L(p,m) = (т — и (т – р). Let X= 
СС’ for some nonsingular С, х= Сх*, а= 1,...,М№ р = Ср, and 
L* (m*, р?) = |т* — иж. Then х*,..., хх are observations from Мих. Г), 
and the problem is reduced to the earlier one. Then 


+ 
p-2 - 
18 1- —————————— X—v)-tv 
( ) N(x-v)'X (x-v) ( ) 
is a minimax estimator of p. 
3.5.3. Estimation for a General Known Covariance Matrix and an 


Arbitrary Quadratic Loss Function 


Let the parent distribution be N(p, X), where X is known, and let the loss 
function be 


(19) L(p,m) = (m — w)'Q(m - в), 


where Q is an arbitrary positive definite matrix which reflects the relative 
importance of errors in different directions. (If the loss function were 
singular, the dimensionality of x could be reduced so as to make the loss 
matrix nonsingular.) Then the sample mean X has the distribution 
№, (1/N)X) and risk (expected loss) 


(20) é(3-u)Q(G-n)- 6 OE E- 9) = их, 


which is constant, not depending on p. 
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Several estimators that improve on X have been proposed. First we take 
up an estimator proposed independently by Berger (1975) and Hudson 
(1974). 


Theorem 3.5.3. Letr(z), 0 <z < oc, be a nondecreasing differentiable func- 
tion such that 0 < r(z) < X p — 2). Then for p = 3 


r(N'(x-vyx'9 x (x-v) 


CD) т= 1  N(x-v)X Q X (X-v) 


oy (x-v)*v 


has smaller risk than X and is minimax. 


Proof. There exists a matrix С such that C'QC =1 and (1/№)» = CAC' 
where A is diagonal with diagonal elements 5, > 5, > ^" > 6, > 0 (Theorem 
A.2.2 of the Appendix). Let ¥=Cy+v and p= Cp* + v. Then у has the 
distribution N(p*, A), and the transformed loss function is 


(22) L*(m* pot) = (m* — ре) (m* — pt) = Пт — А1. 


The estimator (21) of p is transformed to the estimator of и = С (и – у), 


тА 2 
r(y'A yar у. 


(23) те (у) = |1- 5 


We now proceed as in the proof of Theorem 3.5.1. The difference in risks 
between y and m* is 
(24) 
2 
AR( ut) = Gye (IY в Пт Y) - ell] 


2 2(y!A72 
r(Y'A?2Y) & 1 к APY) | 
таат) У & GO i) YAY 


i 


Since r(z) is differentiable, we use Lemma 3.5.1 with (x — 0) = (у; = и), 
апа . 


. r(y'A7 y) 
25 =, 
(25) Ку) y^ 2y 
, , r(y'A y) + 2"(у'А72у) У? _ 2r(y'A ?y) NC 
(26) (у) = yAy y'A y 52 (уа 2уў 8; 
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Then 


(27) 


r Y'A? Y , I r? У'А-?У 
AR) = 4 oo - 2 асту + array) - По} xo 


since r(y'A^?y) x 2(р- 2) and r'(y'A~*y) > 0. и 
Corollary 3.5.1. For p> 3 
(28) 


min[ p -2, N2(#-v)'S"'Q"'3-"(z - v)] 
N(x-vyrQ^?X (x-v) о 


Q'!x'(Xx-v)«*v 


has smaller risk than X and is minimax. 


Proof. the function r(z) = min(p — 2,2) is differentiable except at z= 
p – 2. The function r(z) сап be approximated arbitrarily closely by a differ- 
entiable function. (For example, the corner at z — p — 2 can be smoothed by 


a circular arc of arbitrary small radius.) We shall not give the details of the 
proof. и 


In canonical form y is shrunk by a scalar times a diagonal matrix. The 
larger the variance of a component is, the less the effect of the shrinkage. 
Berger (1975) has proved these results for a more general density, that is, 


for a mixture of normals. Berger (1976) has also proved in the case of 
normality that if 


(29) (a) - = 


for 3— ip <c «1 * 4p, where а is the smallest characteristic root of £Q, 
then the estimator т given by (21) is minimax, is admissible if c « 2, and is 
proper Bayes if c « 1. 

Another approach to minimax estimators has been introduced by Bhat- 
tacharya (1966). Let С be such that C^! (1/N)X(C !Y =I and C'QC = Q*, 
which is diagonal with diagonal elements qf 247 2 ‘~ 2q% > 0. Then y= 
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C7 has the distribution N(p*, Г), and the loss function is 


(30) L* (m* р) 


Р 
Y a (m* - uty 


i=1 


P P i 
- У У ат - wt)” 


— 
Li 


P 
= X ajllm*® — ON 


where aj7 4j -dhi )- liso p l ap = dp. т* = (mt,..., mT)', and 
p! =(ut,..., uf, j=1,..., p. This decomposition of the loss function 


suggests combining minimax estimators of the vectors p*), j = 1,..., p. Let 


у = (ур sony у). 
Theorem 3.5.4. № hy) =[hYPCy),..., (y) is а minimax esti- 


*(Л ; j Dy? у 
mator of р?" under the loss function Мт — pr? j — 1,..., р, then 


12 но 
GD g E POP), ір 


is a minimax estimator of и?,.... Ир. 

Proof. First consider the randomized estimator defined by 
(32) РОКУ) = Hy) = а, УР, 
for the ith component. Then the risk of this estimator is 


р а. i 
G9 Xa 4[c0) - aT ¥ gt ча) - м 


iat jai 9 


P i 
= Ха а.) up 
i=l 


| 
- 
i4 
R 
- 
* 
= 
= 
Lum 
met 
5 
МУ 
| 
т, 
= 


p p 
< Laj= У 4 


j=l j=l 
= Ey L* (Y, By. , 


and hence the estimator defined by (32) is minimax. 
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Since the expected value of GY) with respect to (32) is (31) and the loss 
function is convex, the risk of the estimator (31) is less than that of the 
randomized estimator (by Jensen's inequality). и 


3.6. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


3.6.1. Observations Elliptically Contoured 


Let x... xy be NC=at 1) independent observations on a random vector 
X with density Al (х — Y) A^ (x, — v)]. The density of the sample is 


N 
(1) Ia Eel An Gl. 


The sample mean X and covariance matrix 5 = Q /nXEN. x, — иХх. — в 
— N(X — р — p)'] are unbiased estimators of the mean и — v and the 
covariance matrix E = [4R?/p]A, where К = (x — vy А-х -ъ). 


Theorem 3.6.1. The covariances of the mean and covariance of a sample of 
N from |А\ 3gKx — v) A^!(x — v)] with «К? < oo are 


А - ‚_1 
(2) é(x-u)yx-u-xw 
(3) é(s;— a)(X - n) =0, 1,]=1....,Р. 
к 
(4) & (853 - gi) (Sei 7 9,1) = 70909 + скор + 9010} 


1 D. 
+ x (Oik Og mua): i, j, kl - l..... р. 
Lemma 3.6.1. The second-order moments of the elements of S are 


(5) | 655 = Tijk Т 2 ( G4 9} + опок) + PGU + O; Oj + 019), 
рК =p. 
Proof of Lemma 3.6.1. We have 
N 
O 5 X, (Xia = ш) Gs 7 Bj) Gg в) Gg 7 He) 
-Né(xi— Bi) ja — Bj) (Xba = pa) (Xia — ш) 
+ NON 1) (х, — ni) а By) E (kg А) Gg 7 H) 


= №(1 + к) (озом + Gic Oj + опок) * N(N - Па, 0, 
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Q) &N?(X;- m jj) (X, = в) (0 ш) 


N 
і р 
= N? E L (Xia 7 ш) xrjg 7 By) Qa 7 By) Qs — 1) 
a. B.y. 6-1 


1 
= WG +к) (о; Fy + Oik Oj + 019) 


N-1 
TN (аус + Gin ду + Pj) 


N 1 & 
(8) £ X (Xi, 7 B) ns T ш) у X (х,в = ик) (хь 7 By) 
а=1 В.ү=1 
= (1+к) (ого + ско + опок) + (N - 1) оо: и 
It will be convenient to use more matrix algebra. Define vec B, B 8 C (the 
Kronecker product), and K,,, (the commutator matrix) by 


b, 
(9) vec B=vec(b,,..-,5,)=| : |, 
b, 
ьс e b,c 
(10) BeC-| : 2 
baie КЕ bmnE 
(11) K „n vec B = vec B'. 


See, e.g., Magnus and Neudecker (1979) or Section А.5 of the Appendix. We 
can rewrite (4) as 
(12) €(vec S) = &(vec S — vec X)(vec 5 — vec xy 


= EN +K,.)(%@ 2%) + vec У (хес X). 


Theorem 3.6.2 


(13) ДІ (в) | 


vec S — vec X 


d 0 X 0 ] 
^ (о). («+ D) (I: - K,,) (€ 0 X) iex zy] 
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This theorem follows from the central limit theorem for independent 
identically distributed random vectors (with finite fourth moments). The 
theorem forms the basis for large-sample inference. 


3.6.2. Estimation of Фе Kurtosis Parameter 


To apply the large-sample distribution theory derived for normal distribu- 
tions to problems of inference for elliptically contoured distributions it is 
necessary to know or estimate the kurtosis parameter к. Note that 


(м) [ову - (2a) «(уу 


2 4 
T =р(р+2)(1+к). 


Since x 5 р and $ 5 X, 
14 vol 2412 P 
(15) м XE (ra EVS (xE) ^ р(р+2)(1+ к). 
а=1 


А consistent estimator of к is 


11 


N 
(16) R= TIN È (жа) Ца, - 1. 


Mardia (1970) proposed using M to form a consistent estimator of x. 


3.6.3. Maximum Likelibood Estimation 


We have considered using S as an estimator of X; =(&R*/p)A. When the 
parent distribution is normal, $ is the sufficient statistic invariant with 
respect to translations and hence is the efficient unbiased estimator. Now we 
study other estimators. 

We consider first the maximum likelihood estimators of j and А when 
the form of the density g(-) is known. The logarithm of the likelihood 
function is 


x _ 
(17) log L= – Хол + Y, logg[(x, — B) A7 (x, - в]. 


а=1 
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The derivatives of log L with respect to the components of p are 


dlogh _ _ N g'[(x, = n) A^ (x, — BDI 
(18) бы 2% gixa- н) A^ Gr. = в] 


Setting the vector of derivatives equal to 0 leads to the equation 
(19) 
м ао ВА] N [АКВ] 


e gilea- 8) Á (х,- 8)] RE о AVA (s, A] 


Setting equal to 0 the derivatives of log L with respect to the elements of 
AT! gives 


А! (x, — р). 


2003 N g's- ÂÂ xa- À) 
0 Á--2 Y AB io oe 
Q9 A*-w P ds, Ry s 8)] 


The estimator А is a kind of weighted average of the rank 1 matrices 
(x, — fx, — В)’. In the normal case the weights are 1/N. In most cases 
(19) and (20) cannot be solved explicitly, but the solution may be approxi- 
mated by iterative methods. 

The covariance matrix of the limiting normal distribution of VN (vec А — 
vec A) is 


(x. В) (х. 0)". 


(21)  €(vec A) = ay (Ij + Kp» (A € A) + a5, vec Л (мес A)’, 


where 

(22) о = И 
[erm | 

(23) Зов = ore) 


78 2 +р(1 =T) 


See Tyler (1982). 


3.6.4. Elliptically Contoured Matrix Distributions 
Let 


(24) Y- 
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be an N X p random matrix with density g(Y'Y) =, y, у). Note that 
the density g(Y'Y) is invariant with respect to orthogonal transformations 
Y* = O,Y. Such densities are known as left spherical matrix densities. An 
example is the density of N observations from N(0, I), 


р = 1 -iwYY 
(25) BV) = coge int 


In this example Y is also right spherical: YO, £ y, When Y is both left 
spherical and right spherical, it is known as spherical. Further, if Y has the 


density (25), vec Y is spherical; in general if Y has a density, the density is of 
the form 


Q6) (СУУ) =| È Ex. = g(tr YY’) 

=g[(vec Y )' vec Y] =g[(vec У')' vec Y']. 
We call this model vector-spherical. Define 
(27) X=YC' +eyp', 


where C'A^!C- I, and ғу = (1,...,1). Since (27) is equivalent to Y= 
(X - e," XC)! and (C)! C! = A^!, the matrix X has the density 


(28) IAI" Zg[tr( еур) ^^ (C7 9] 
v N 
АЕ” E (хав) A (х, в) |. 
а= 1 
From (26) we deduce that vec Y has the representation 


(29) vec Y= R vec И, 


where w — R? has the density 
(30) 


vec U has the uniform distribution on EY., Риз, = 1, апа К and vec U аге 
independent. The covariance matrix of vec Y is 


, eR ER? 
(31) é vec Y(vec Y) = Np = Np (19915): 
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Since vec FGH = (H' & F)vec С for any conformable matrices F, С, and Н, 
we can write (27) as 


(32) vec X - (C8 I) vec Y - р 9 €y. 
Thus 
(33) ф vecX- pn 8ty, 
` ER? 

(34) € (чес X) = (C8 Iy)é (vec Y)(C'91y) = -Np ^ 8 Íy, 
(35 | é(rowof X) =, 

2 
(36)  €(rowof X') = ^is А. 


The rows of X are uncorrelated (though not necessarily independent). From 
(32) we obtain 


(37) vec X É R(C 8I) vecU +u G ey, 
(38) X £ RUC' +. 


Since X- epp’ = (Х- £X) +е,(Х и), and ey(X — yx") = 0, we can 
write the density of X as 


(39) ТАА (х еу) (Хеу) * N(x p) AT 1x - u)l, 


where x = (1/N)X'ey. This shows that a sufficient set of statistics for р and 
A is X and nS = (X — eyx')'(X — £yX^), as for the normal distribution. The 
maximum likelihood estimators can be derived from the following theorem, 
which will be used later for other models. 


Theorem 3.6.3. Suppose the m-component vector Z has the density 
[Фі Ас vy di^ (z — v)], where w?"h(w) has а finite positive maximum at 

и, and Ф is a positive definite matrix. Let Q be a set in the space of (v, ®) 
such that if (v, Ф) € О then (v, c) ЕП for all c > 0. Suppose that on the 
basis of an observation z when h(w) = const e^ w (ie, Z has a normal 
distribution) the maximum likelihood estimator (v, Ф) € © exists and is unique 
with Ф positive definite with probability 1. Then the maximum likelihood 
estimator of (v, ®©) for arbitrary h(-) is 


(40) $-v, = 
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and the maximum of the likelihood is || - *h(w,) [Anderson, Fang, and Hsu 
(1986)]. 


Proof. Let Ф = |$| "4 and 


41 d-(z—-vye-!(z—v _ (2-0) (2-0) 
(41) ааъ) = ЧЕ»). 


Then (v, $) Е N and |W] = 1. The likelihood is 

(42) [2 -vy Yz- v)] ата). 

Under normality h(d) = 2т)- ?"e~ V, and the maximum of (42) is attained 
at v=7, P= = |Ф|- "r$, and d- т. For arbitrary h(:) the maximum 


of (42) is attained at $ — v, B — B, and d= w,. Then the maximum likeli- 
hood estimator of Ф is 


(43) ФФ" $ Ф. 


Then (40) follows from (43) by use of (41). = 


Theorem 3.6.4. Let X (N X p) have the density (28), where w?™Pg(w) has 


a finite positive maximum at w,. Then the maximum likelihood estimators of y. 
and A are 


(44) &-z А-д 


where А = YN x, – Хх, — XY. 


l Corollary 3.6.1. Let X (N x p) have the density (28). Then the maximum 
likelihood estimators of v, (A,,,..., tea and ii i,j=1,...,p, are X 


(р/у, Хаз, ..., App), and a;;/ Маца, i j = 


Proof. Corollary 3.6.1 follows from Theorem 3.6.3 and Corollary 32.1. m 


Theorem 3.6.5. Let f(X) be a vector-valued function of X (N x p) such 
that 


(45) f(X+ eyv') = f(x) 
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for all v und 
(46) Р(Х) =f(X) 


for all c. Then the distribution of f(X) where X has an arbitrary density (28) is 
the same as its distribution where X has the normal density (28). 


Proof. Substitution of the representation (27) into f( X) gives 
(47) f(X) =Л(УС' + eye’) =Л(УС') 
by (45). Let f(X) = h(vec X). Then by (46), h(cX) = h( X) and 


(48) f(¥C') =h[(C 9 I) vec Y] =h[ R(C 6 Ij) vecU] 
=h[(C 914) vecU]. а 


Any statistic satisfying (45) and (46) has the same distribution for all g(-). 
Hence, if its distribution is known for the normal case, the distribution is 
valid for all elliptically contoured distributions. 

Any function of the sufficient set of statistics that is translation-invariant, 


that is, that satisfies (45), is a function of S. Thus inference concerning X can 
be based on S. 


Corollary 3.6.2. Let f(X) be a vector-valued function of X (N X p) such 
that (46) holds for all c. Then the distribution of f(X) where X has arbitrary 


density (28) with р, = © is the same as its distribution where X has normal density 
(28) with p = 0. 


Fang and Zhang (1990) give this corollary as Theorem 2.5.8. 


PROBLEMS 


3.1. (Sec. 3.2) Find р, $, and (р;) Юг the data given in Table 3.3, taken from 
Frets (1921). 


3.2. (Sec. 32) Verify the numerical results of (21). 


3.3. (Sec. 3.2) Compute р, Ê, S, and р for the following pairs of observations: 
(34, 55), (12, 29), (33, 75), (44, 89), (89, 62), (59, 69), (50, 41), (88, 67). Plot the obser- 
vations. 


3.4. (Sec. 3.2) Use the facts that |C*| = ПЛ, tr C* = ZA; and C* 21 if A,— + 


=A, = 1, where A,,...,A, are the characteristic roots of C*, to prove Lemma 
3.2.2. [ Hint: Use f as given in (12).} 
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Table 3.3*, Head Lengths and Breadths of Brothers 


Head Head Head Head 
Length, Breadth, Length, Breadth, 
First Son, First Son, Second Son, Second Son, 
х X2 X3 X4 
191 155 179 145 
195 149 201 182 
181 148 185 149 
183 153 188 149 
176 144 171 142 
208 157 192 152 
189 150 190 149 
197 159 189 152 
188 152 197 159 
192 150 187 151 
179 158 186 148 
183 147 174 147 
174. 150 185 152 
190 159 195 157 
188 151 . 187 158 
163 13" 161 130 
195 155 183 158 
186 153 173 148 
181 145 182 146 
175 140 165 137 
192 154 185 152 
174 143 178 147 
176 139 176 143 
197 167 200 158 
190 163 187 150 


"These data, used in examples in the first edition of this book, came from Rao 
(1952), p. 245. Izenman (1980) has indicated some entries were apparently 
incorrectly copied from Frets (1921) and corrected them (p. 579). 


3.5. (бес. 3.2) Let x, be the body weight (in kilograms) of a cat and x, the heart 
weight (in grams). [Data from Fisher (1947b).] 


(a) In a sample of 47 female cats the relevant data are 


Xx = ух x = | 265-13 102962) 
Xa = 04325), aže = | 1029.62 4064.71]: 
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Table 3.4. Four Measurements on Three Species of Iris (in centimeters) 


Iris setosa 
Sepal Sepal Petal Petal 
length width length width 
5.1 3.5 14 0.2 
4.9 3.0 14 0.2 
4.7 3.2 13 0.2 
4.6 3.1 1.5 0.2 
5.0 3.6 14 0.2 
5.4 3.9 1.7 Q.4 
4.6 34 14 0.3 
5.0 34 1.5 0.2 
44 2.9 14 0.2 
4.9 3.1 1.5 0.1 
5.4 37 1.5 0.2 
4.6 3.4 1.6 0.2 
4.8 3. 14 0.1 
4.3 3.0 1.1 0.1 
5.8 4.0 12 0.2 
5.7 44 1.5 0.4 
5.4 3.9 1.3 0.4 
5.1 3.5 1.4 0.3 
57 3.8 1.7 0.3 
5.1 3.8 1.5 0.3 
54 3.4 17 0.2 
5.1 37 1.5 0.4 
4.6 3.6 1.0 0.2 
5.1 3.3 17 0.5 
4.8 34 19 0.2 
5.0 3.0 1.6 0.2 
50 34 1.6 0.4 
5.2 3.5 15 0.2 
52 34 1.4 0.2 
47 3.2 1.6 0.2 
4.8 3.1 1.6 0.2 
5.4 3.4 1.5 0.4 
5.2 4.1 1.5 0.1 
5.5 4.2 1.4 0.2 
4.9 3.1 1.5 0.2 
5.0 3.2 1.2 0.2 
5.5 3.5 1.3 0.2 
4.9 3.6 14 0.1 
4.4 3.0 1.3 0.2 
5.1 34 1.5 0.2 


Iris versicolor Tris virginica 
Sepal Sepal Petal Petal | Sepal Sepal Petal Petal 
length width length width | length width length width 
7.0 3.2 47 14 6.3 33 6.0 2.5 
6.4 32 4.5 1.5 5.8 27 5.1 19 
6.9 3.1 4.9 L5 7.1 3.0 5.9 2.1 
5.5 2.3 4.0 1.3 6.3 2.9 5.6 1.8 
6.5 2.8 4.6 15 6.5 30 5,8 2.2 
5.7 2.8 4.5 1.3 7.6 3.0 6.6 2.1 
6.3 3.3 4.7 1.6 4.9 2.5 4.5 17 
49 .24 33 1.0 7.3 2.9 6.3 18 
6.6 2.9 4.6 1.3 6.7 2.5 5.8 1.8 
5.2 27 3.9 14 7.2 3.6 6.1 2.5 
5.0 20 3.5 10 6.5 3.2 5.1 2.0 
5.9 3.0 4.2 1.5 6.4 2.7 5.3 1.9 
6.0 22 40 1.0 6.8 3.0 5.5 2.1 
6.1 29 4.7 14 5.7 2.5 5.0 2.0 
5.6 2.9 3.6 1.3 5.8 2.8 5.1 2.4 
6.7 3.1 44 14 6.4 3.2 5.3 2.3 
5.6 3.0 4.5 1.5 6.5 3.0 5.5 1.8 
5.8 27 4.1 1.0 77 3.8 6.7 22 
6.2 22 4.5 L5 77 2.6 6.9 2.3 
5.6 2.5 3.9 li 6.0 22 5.0 1.5 
5.9 3.2 4.8 1.8 6.9 3.2 5.7 23 
6.1 2.8 40 13 5.6 2.8 4.9 2.0 
6.3 2.5 49 1.5 77 2.8 6.7 2.0 
6.1 2.8 47 12 6.3 2.7 49 1.8 
6.4 29 43 13 6.7 33 5.7 2.1 
6.6 3.0 44 14 72 32 6.0 1.8 
6.8 2.8 48 14 62 28 4.8 1.8 
6.7 3.0 50 17 6.1 3.0 49 1.8 
6.0 29 4.5 1.5 6.4 28 5.6 2.1 
5.7 2.6 3.5 1.0 7.2 3.0 5.8 1.6 
5.5 2.4 3.8 1.1 7.4 2.8 6.1 1.9 
5.5 2.4 3.7 1.0 7.9 3.8 6.4 20 
58 27 39 1,2 6.4 2.8 5.6 2.2 
6.0 2.7 5.1 1.6 6.3 2.8 5.1 1.5 
5.4 3.0 4.5 1.5 6.1 2.6 5.6 1.4 
6.0 3.4 4.5 1.6 77 3.0 61 2.3 
6.7 34 4.7 L5 63 34 5.6 24 
63 23 4.4 1.3 6.4 3.1 5.5 1.8 
5.6 3.0 4.1 1.3 6.0 3.0 4.8 1.8 
5.5 2.5 4.0 13 6.9 3.1 54 24 


Table 3.4. (Continued) 
Tris setosa 


Petal Petal | Sepal Sepal 
length width [length width 


Tris versicolor Tris virginica 
Sepal Sepal 
length width 


Petal Petal 
length width 


Sepal Sepal 
length width 


Petal Petal 


length width 


(b) In a sample of 97 male cats the relevant data are 


ха, = ( 2813. Ухх - ( 836.75 


3275.55 
1098.3 “ 13275.55 у 


13056.17 
Find В, У, S, and р. 


3.6. Find fi, €, and ( Bi) for Iris setosa from Table 3.4, taken from Edgar Anderson’s 
famous iris data [Fisher (1936)]. 


3.7. (Sec. 3.2) Invariance of the sample correlation coefficient. Prove that гүз is an 
invariant characteristic of the sufficient statistics x and S of a bivariate sample 
under location and scale transformations (xf, = b;x;, +c, bj» 0, i=1,2, a= 
1,..., №) and that every function of x and S that is invariant is a function of 
гу. [ Hint: See Theorem 2.3.2.] 


3.8. (Sec. 3.2) Prove Lemma 3.2.2 by induction. [Hint: Let Ну =h; 


Н; = Ha ho , 
| (1) hi 


and use Problem 2.36.) 


3.9. (Sec. 2,2) Show that 


1 N 
NONI) LC. 7x), — лв)" = Wo X (x, - X)(x, - Xy. 


(Note: When p = 1, the left-hand side is the average squared differences of the 
observations.) 
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3.10. (Sec. 3.2) Estimation of X; when y. is known. Show that if x,,. .., xy constitute 
a sample from №, X) and p is known, then (1/N)EN. (x, — gXx, — В) is 
the maximum likelihood estimator of X. ' 


3.11. (Sec. 3.2) Estimation of parameters of a complex normal distribution. Let 
2... zy be N observations from the complex normal distributions with mean 
6 and covariance matrix P. (See Problem 2.64.) 


(a) Show that the maximum likelihood estimators of @ and P are 
N тм 
му, Pay D-DD". 


(b) Show that z has the complex normal distribution with mean 6 and covari- 
ance matrix (1/N)P. 

(с) Show that z and P are independently distributed and that NP has the 
distribution of 7" _ ,W, W*, where W,,...,W, are independently distributed, 
each according to the complex normal distribution with mean 0 and covari- 
ance matrix P, and n = № — 1. 


3.12. (Sec. 3.2) Prove Lemma 3.2.2 by using Lemma 3.2.3 and showing N log|C| — 
tr CD has a maximum at C = ND"! by setting the derivatives of this function 
with respect to the elements of C = X ^! equal to 0. Show that the function of С 
tends to —oo as С tends to a singular matrix or as one or more elements of С 
tend to oo and/or – оо (nondiagonal elements); for this latter, the equivalent 
of (13) can be used. 


3.13. (Sec. 3.3) Let X, be distributed according to N(yc,, X), a = 1,..., М, where 
Ec? > 0. Show that the distribution of g = (1/Zc2)Ec, X, is NLy,Q/Ec2XI. 
Show that Е = X,(X,— gc,XX, — вс.) is independently distributed as 
EN-1Z,Z., where Z,,...,Z, are independent, each with distribution мо, X). 
[Hint: Let Za = Eb, 5X5, where byg = ср/ V Lez and B is orthogonal.] 


3.14. (Sec. 3.3) Prove that the power of the test in (19) is a function only of p and 
СА А ИОМ + NK" = a) X ^ (aC? — д), given а. 


3.15. (Sec. 3.3) Efficiency of the mean. Prove that х is efficient for estimating p. 


3.16. (Sec. 3.3) Prove that x and S have efficiency [(N — 1)/N ]&»*D7? for estimat- 
ing p and X. 


3.17. (Sec, 3.2) Prove that РЦ!А| = 0} = 0 for A defined by (4) when М>р. L Hint: 
Argue that if Z* —(Z,...,Z,) then 1251 #0 implies A -—-Z + 
ENI., 12,2, is positive definite. Prove РАН = 22| + EIZ Z, cof(Z,,) 
= 0} = 0 by induction, j = 2,..., p.] 
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3.18. (Sec. 3.4) Prove 


I-$(64X) '-X(6-X)'. 
Ф Ф(ФъХ) Ф (Ф +E). 


3.19. (Sec. 3.4) Prove (1/NDEN. (x, - Хх, — Ы)? is ап unbiased estimator of £ 
when p. is known. 


3.20. (Sec. 3.4) Show that 


a i 
e(o ^ x) x+ xx(*^ ye 


1 
N ъ= (Ф! + NET!) (NX^Cx4 v). 


3.21. (Sec. 3.5) Demonstrate Lemma 3.5.1 using integration by parts. 


3.22. (Sec. 3.5) Show that 


оо (col — Y с-нда MN о-в д, 
] | Pore O Fee x ео f, Pol ae dy, 
1 


1 vu f "6s и dy., 
dxdy - | If laze 


f [rove >= Queer 


3.23. Let. Z(k) = CZ, (K), where iL... P. ј=1....,9 and k= 1.2..... be a 
sequence of random matrices. Let one norm of a matrix A be N(A)= 
тах, ,mod(a,,), and another be N,(A)=,,,a7,= tr АА’. Some alternative 
ways of defining stochastic convergence of Z(k) to B (p Xq) are 


(a) N(Z(k) — B) converges stochastically to 0. 
(b) №.(2(®) — В) converges stochastically to 0, and 
(с) Z,(k) ~ bj; converges stochastically to 0, = 1,...,р, j 7 b... 4 


Prove that these three definitions are equivalent. Note that the definition of 
X(k) converging stochastically to а is that for every arbitrary positive à and =. 
we can find К large enough so that for k > К 


Pr(IX(K) al «8] >1- е. 


3.24. (Sec. 3.2) Covariance matrices with linear structure [Anderson (1969)]. Let 


(i) х= Хаб... 
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where Gy,...,G, are given symmetric matrices such that there exists at least 
опе (q + 1)-tuplet op, 0,,..., 0, Such that (1) is positive definite. Show that the 
likelihood equations based on N observations are 


(ii) -Xu 3-16, + 45710,7 =0, #=0,1,..., 4. 
Show that an iterative (scoring) method can be based on 


А 
^ ^ КТ 1 - - 
(iii) Y $746, 276,60 = tr $156,214, 8=0,1,...,4, 
й=0 


$ 60-0 
where 5,124.06 Су. 


CHAPTER 4 


The Distributions and Uses of 
Sample Correlation Coefficients 


4.1. INTRODUCTION 


In Chapter 2; in which the multivariate normal distribution was introduced, it 
was shown that a measure of dependence between two normal variates is the 
correlation coefficient p; = ИУ 7,0;. In a conditional distribution of 
X,,..., X, given X, Ха)... Xp хр the partial correlation jj.941,...,p 
measures the dependence between X; and X;. The third kind of correlation 
discussed was the multiple correlation which measures the relationship 
between one variate and a set of others. In this chapter we treat the sample 
equivalents of these quantities; they are point estimates: of the population 
quantities. The distributions of the sample correlations are found. Tests of 
hypotheses and confidence intervais are developed. 

In the cases of joint normal distributions these correlation coefficients are 
the natural measures of dependence. In the population they are the only 
parameters except for location (means) and scale (standard deviations) pa- 
rameters. In the sample the correlation coefficients are derived as the 
reasonable estimates of th» population correlations. Since the sample means 
and standard deviations are location and scale estimates, the sample correla- 
tions (that is, the standardized sample second moments) give all possible 
information about the population correlations. The sample correlations are 
the functions of the sufficient statistics that are invariant with respect to 
location and scale transformations; the population correlations are the func- 
tions of the parameters that are invariant with respect to these transforma- 
tions. 


An Introduction to Multivariate Stat.stical Analysis, Third Edition. By T. W. Anderson 
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In regression theory or least squares, one variable is considered random or 
dependent, and the others fixed or independent. In correlation theory we 
consider several variables as random and treat them symmetrically. If we 
start with a joint normal distribution and hold all variables fixed except one, 
we obtain the least squares model because the expected value of the random 
variable in the conditional distribution is a linear function of the variables 
held fixed. The sample regression coefficients obtained in least squares are 
functions of the sample variances and correlations. 

In testing independence we shall see that we arrive at the same tests in 
either casc (ie., in the joint normal distribution or in the conditional 
distribution of least squares). The probability theory under the null hypothe- 
sis is the sarne. The distribution of the test criterion when the null hypothesis 
is not true differs in the two cases. If all variables may be considered random, 
one uses correlation theory as given here; if only one variable is random, 
one uses least squares theory (which is considered in some generality in 
Chapter 8). 

In Section 4.2 we derive the distribution of the sample correlation coeffi- 
cient, first when the corresponding population correlation coefficient is 0 (the 
two normal variables being independent) and then for any value of the 
population coefficient. The Fisher z-transform yields a useful approximate 
normal distribution. Exact and approximate confidence intervals are devel- 
oped. In Section 4.3 we carry out the same program for partial correlations, 
that is, correlations in conditional normal distributions. In Section 4.4 the 
distributions and other properties of the sample multiple correlation coeffi- 
cient are studied. In Section 4.5 the asymptotic distributions of these cor- 
relations are derived for elliptically contoured distributions. A stochastic 
representation for a class of such distributions is found. 


42. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 


4.2.1. The Distribution When the Population Correlation Coefficient Is Zero; 
Tests of the Hypothesis of Lack of Correlation 


In Section 3.2 it was shown that if one has a sample (of p-component vectors) 
xj;,...,Xy from a normal distribution, the maximum likelihood estimator of 


the correlation between X, and X, (two components of the random vector 
X) is 


Eas -X)(xj. -X) 


(1) n= SS ee’ 
ree (tia 7X) XN (xj, 7 Xj) 
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where x,, is the ith component of x, and 
QN 
Q) Xj = N У Ха, 


In this section we shall find the distribution of r,; when the population 
correlation between X, and X; is zero, and we shall see how to use the 
sample correlation coefficient to test the hypothesis that the population 
coefficient is zero. 

For convenience we shall treat rı; the same theory holds for each r;;. 
Since ғ; depends only on the first two coordinates of each x,, to find the 
distribution of г) we need only consider the joint distribution of Gui хи), 
(хуз, X22) (хім ам). We can reformulate the problems to be considered 
here, therefore, in terms of a bivariate normal distribution. Let xf,..-, xy be 


observation vectors from 
By oj 919 р 
(3) м |, 2р ||, 
Нә 050; p 97 


We shall consider 


4 21» 
r= , 
e a4 y02 
where 
N - . . 
(5) 8; = У (a 73005 739) ї.ј = 12. 
а= 


and x, is defined by (2), x;, being the ith component of xz. 
From Section 3.3 we see that а), a), and аз are distributed like 


n 
(6) а; = Y 2.234; i,j=1,2, 
а=1 
where n =N — 1, (Zias 22а) is distributed according to 


N 0 с} 0410; р 
(7) 0}? | о. о, р cj , 


and the pairs (211, zz. Cis 2 м) are independently distributed. 


118 SAMPLE CORRELATION COEFFICIENTS 


Figure 4.1 


Define the n-component vector v,—(;,...,7,,), {= L,2. These two 
vectors can be represented in an n-dimensional space; see Figure 4.1. The 
correlation coefficient is the cosine of the angle, say 0, between v, and vp. 
(See Section 3.2.) To find the distribution of cos Ө we shall first find the 
distribution of cot 0. As shown in Section 3.2, И we let b = rv, /viv,, then 
ra- br, is orthogonal to г, and 


(8) cot 9 = 


If р, is fixed, we can rotate coordinate axes so that the first coordinate axis 
lies along v,. Then br, has only the first coordinate different from zero, and 
v, — bv, has this first coordinate equal to zero. We shall show that cot Ө is 
proportional to a f-variable when p= 0. 

We use thc following lemma. 


Lemma 4.2.1. ЈҒҮ,,..., Y, are independently distributed, if Y, = (Y, Y") | 


has the density fly,)s and if the conditional density of Y? given y? = yO is 
FOSI, @=1,...,n, then in the conditional distribution of YQ,..., Y? 
given YU 9 yt, ... , y = у, the random vectors Y,...,¥ are independent 
and the density of Y?! is Кубу), а= 1,..., n. 

Proof. The marginal density of Y(9...., Y(? is Пл, 050), where fo9» 


а 


is the marginal density of Y(?, and the conditional density of YO, Y? 


а › 


given У = yl) YP = yP is 


П.А.) _ py О) _ n "n 
©) TM ЛО) Il AOL) П К» рА ) Li 
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Write V, = (Z;,..., Zin), {= 1,2, to denote random vectors. The condi- 
tional distribution of Z,, given Z,, —z, is N( Bz;,, 9”), where B= po,/o, 
and c? = 02(1 — p°). (See Section 2.5.) The density of V, given V, — v, is 
М Bv, a? 1) since the 2, ; are independent. Let b = Ирр (= а/а), 
so that Би (И, — Бу) = 0, and let И= (V, — bu) (V, — bv) = ЗИ, — b’ vip, 
(=a, –а?, /аџ). Then cot 6 = Буа /U. The rotation of coordinate axes 
involves choosing an n X п orthogonal matrix С with first row (1 /с)р;, where 
c? = viu. 

We now apply Theorem 3.3.1 with X, = Zza. Let Y, = Y$c,5Z,5, “= 
1,...,n. Then Y,...,Y, are independently normally distributed with vari- 


ance о? and means 
n n 
(10) ФУ, = L C1, BZ, = B 2, Zi, = Вс, 
n n 
(11) ФУ. = 2 Cay Bz,, = Bc 2 Cayliy = 0, o * 1. 


We have b = Eral Z2eZ1e/ Msi LA =c eias Cu (С? = Yi/c and, from 
Lemma 3.3.1, 


a-l а=1 


п n n 
(12) и. У ү? ҮР 
а=1 


which is independent of b. Then U/c? has a x*distribution with n— 1 
degrees of freedom. i 


Lemma 4.2.2. If (Z,,,Z,,), @=1,...,n, are independent, each pair with 
density (Т), then the conditional distributions of b = Y^ ZZ, Zi, / EZ, 
and О/о? = Y^ (Z,,— bZ,,/o? given Zia= Zio а= 1,...,п, ае 
МВ, 9? /с?) (c?  Y.,22,) and x? with n — 1 degrees of freedom, respec- 
tively; and b and U are independent. 


if p=0, then В=0, and b is distributed conditionally according to 
N(0, o? /c?), and 


(13) с/о __ 
U/a? U 
п-1 n-1 
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has a conditional :-distribution with n — 1 degrees of freedom. (See Problem 
4.27.) However, this random variable is 


85/ y 0302 
1- [22 /(ацаз,)] 


-/n-l— 
vı 


Thus vn — 1r/ Y1—r? has a conditional t-distribution with n — 1 degrees of 
freedom. The density of t is 


(14) Yn-1 yan 612/41 _ 


_ 52 
а -a/a 


r 
— r? 


r(3n) gy? 
(5 Ran Dye * zc] ' 


and the density of W=r/V¥1—r? is 


| T(in) 2,73" 
(16) Tine уг“. 


Since и = r(1 — r?) , we have dw/dr = (1 — ғ2)- 3, Therefore the density of 
r is (replacing n by N — 1) 


P(N - 1)] 


(7) ВС 


(1-02) 79. 


It should be noted that (17) is the conditional density of r for v, fixed. 


However, since (17) does not depend on v, it is also the marginal density 
of r. 


Theorem 4.2.1. Let X,..., Xy be independent, each with distribution 
N(p, X). If р; = 0, the density of т, defined by (1) is (17). 


From (17) we see that the density is symmetric about the origin. For 
М > 4, it has a mode at r = 0 and its order of contact with the r-axis at +1 is 
3(N - 5) Юг М odd and IN —3 for М even. Since the density is even, the 
odd moments are zero; in particular, the mean is zero. The even moments 
are found by integration (letting х =r? and using the definition of the beta 
function). That. Zr?" = T(3(N — DCm + D/GmT TION — 1) + тр and іп 
particular that the variance is 1/(N — 1) may be verified by the reader. 

The most important use of Theorem 4.2.1 is to find significance points for 
testing the hypothesis that a pair of variables are not correlated. Consider the 
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hypothesis 
(18) H: ру 70 


for some particular pair (i,j). It would seem reasonable to reject this 
hypothesis if the corresponding sample correlation coefficient were very 
different from zero. Now how do we decide what we mean by “very 
different"? 

Let us suppose we are interested in testing Н against the alternative 
hypotheses р; 0. Then we reject H if the sample correlation coefficient rj; 
is greater than some number ry. The probability of rejecting Н when H is 
true is 


(19) f kso dr, 


where ky(r) is (17), the density of a correlation coefficient based оп М 
observations. We choose r, so (19) is the desired significance level. If we test 
Н against alternatives p;; < 0, we reject H when r;; < —то. 

Now suppose we are interested in alternatives р,; #0; that is, р;; may be 
either positive or negative. Then we reject the hypothesis H if r7 r, or 
rj € —r,. The probability of rejection when H is true is | 


(20) fio dr + kn) dr. 


The number №; is chosen so that (20) is the desired significance level. 


The significance points ғ; are given in many books, including Table VI of 
Fisher and Yates (1942); the index n in Table VI is equal to our N — 2. Since 
YN—2r/Y1-—r? has the t-distribution with № —2 degrees of freedom, 
t-tables can also be used. Against alternatives ру; #0, reject Н if 


Г 
р > ty-2(@), 


1-rij 


(21) N -2 


where ty. ;( o) is the two-tailed significance point of the t-statistic with N — 2 
degrees of freedom for significance level o. Against alternatives р;; > 0. 
reject H if 


nm 
(22) N -2——— > ty-2(20). 


y 1 ij 
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From (13) and (14) we see that УМ —2r/ Y1 — r? is the proper statistic for 
testing the hypothesis that the regression of У, on r is zero. In terms of the 
original observation (x;,). we have 


(23) /N-3 r by XN Qn. 7X) 
don VEL (к, х 


12 
і 
> 
mM 
= 
| 
= 
L 
m 

гә 
N 
— 
2, 
| 
N 
— 


where b = ENG, ^ 39, ИУ Gu, 7X0 is the least squares re- 
gression coefficient of xs, on x,,. И is seen that the test of pj; = 0 is 
equivalent to the test that the regression of X, on x, is zero (ie., that 
p1,0,/2, = 0). | 

To illustrate this procedure we consider the example given in Section 3.2. 
Let us test the null hypothesis that the effects of the two drugs are uncorre- 
lated against the alternative that they are positively correlated. We shall use 
the 5% level of significance. For № = 10, the 5% significance point (ro) is 
0.5494. Our observed correlation coefficient of 0.7952 is significant; we reject 
the hypothesis that the effects of the two drugs are independent. 


4.2.2. The Distribution When the Population Correlation Coefficient Is 
Nonzero; Tests of Hypotheses and Confidence Intervals 


To find the distribution of the sample correlation coefficient when the 
population coefficient is different from zero, we shall first derive the joint 
density of а, а, and аз. In Section 4.2.1 we saw that, conditional on v, 
held fixed, the random variables b = а, /а;; and U/a? = (a5 — a2, /ay,)/o? 
are distributed independently according to N( В, c? /c?) and the y?-distribu- 
tion with я — 1 degrees of freedom, respectively. Denoting the density of the 
x -distribution by g,.,(4), we write the conditional density of b and U as 
n(b| B, o /a,)g, ,.(4/07)/o?. The joint density of И, b, and U is 
n(r10, ссп В, o^/a0g, (4/0 7)/07. The marginal density of 
V;V,/od =a,,/o; is g,(u); that is, the density of a, is 


1 21 2 
24 ое |= f.. 
( ) s [2 ] J n(v,l0, орг) aw, 


a rye ayy 


where dW’ is the proper volume element. 
The integration is over the sphere vir; =a; thus, dW is an element of 
area on this sphere. (See Problem 7.1 for the use of angular coordinates in 
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defining dW.) Thus the joint density ог Б, U, and a, is 


(25) f . -f (blg, о? /as)8, -i(u/0?) отб, о?) ай 


A = 
oye, =a, 


g(2/91 )n(hl B, о? /ап)8-1(и/0°) 


odo? 
Lo(Qu 24a | Veu 2n (p 
i anpra 262 ер 200 7 
Qe?) T(2n) 01 7 
. 1 
п -1) 
(207) Priza- 1 


10-36 (- и ) 
u Xp 725,2]: 


Now let b = а/а, U = аз) — а, /ац. The Jacobian is 


1 
ó(b,u) au 1 
26 (ри) |. 2d 
(26) ae 292 1| ^n 
ay 


2 \ 34-3) 
ie ( tuti) e- 


aj а 
u 
(27) ooo 
2ofos'(1— р?) "Уат (вт) [s(n - 17] 
where 
Q8) Q=% , 21| 42 5,717: 2n poio) 1 |, ар 
og о? A of 8 Ti c?| а 
Z + pofo? po; an 
ор 0102(1— р?) " 7,07 (1-p’) 02 (1- р?) 
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The density can be written 


(29) [A] o7» е0 
2^|x| "aT (3n)r[4(n – 1)] 


for A positive definite, and. 0 otherwise. This is a special case of the Wishart 
density derived in Chapter 7. 
We want to find the density of 


(30) r= T 559 


ж 
85 — 
Уана» а а at,” 
1122 (4/07) (2/02) 11922 


where аў — а, /01, a3, 7 a5,/07, and af, = a45/(0, 0). The traasformation 
is equivalent to setting о = о, =1. Then the density of a4, a, and 


r—-4y/y4545 (day = аааз ) is 


aptata yere 
Q1) аһ om ео 

2"(1~ p?) Vs TGn)r[3(n - 1)] 
where 


(32) Q= T 


To find the density of r, we must integrate (31) with respect to a,, and a2, 
over the range to oc. There are various ways of carrying out the integration, 
which result in different expressions for the density. The method we shall 
indicate here is straightforward. We expand part of the exponential: 


(33) oy] | $ (бы Van)” 
(1 2) а=0 а!(1 — p?)* 


Then the density (31) is 


(34) И О 0 S (en) - 
(ан) - 1] «o a - 22) 


loses pe eem 
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Since 
со Hata) 
Kate da - Тп + a)] 2 - ^?) 
(35) Í, a e| - i | [5 
the integral of (34) (term-by-term integration is permissible) is 
(1 _ pry?) 
(1 = p)" 2a r(an)F (n - D] 


co 


(36) 


. Cpr)* n+ a)j2"*7(1 _ p 
а=0 n rb 


_(1=p?)"(-r 
val (in)F[5 Е 1) 4 


у“ 3) = 
L Gory" r^i (n+ a)]. 
The duplication formula for the gamma function is 


| 22*1Г(2)(2+5) 
(37) Г(22) = Tm . 
It can be used to modify the constant in (36). 


Theorem 4.2.2. The correlation coefficient in a sample of N from a bivariate 
normal distribution with correlation p is distributed with density 


tn i(n-3) œ a 
2ü-p)'a-rny ри)" capis 4a) 
——— 0 2 , 
(38) (n-2)v r a! E | ] 
—Izxrxl, 
where n = N — 1. 


The distribution of r was first found by Fisher (1915). He also gave as 
another form of the density, 
u 


(1- р2) "(1-02)" SE [sec] 


(39) n(n-i € dn | yr 


See Problem 4.24. 
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Hotelling (1953) has made an exhaustive study of the distribution of r. He 
has recommended the following form: 


п-1 T(n in Hn-3) 
(40) m Bey yar) 
Qon) "ras nti er), 
where 
(41) F(a,b;c; x) = Y T(atj) Г(Ь+Ј) ГС) x^ 


A (а) ГО) Keen) 


is a hypergeometric function. (See Problem 4.25.) The series in (40) converges 
more rapidly than the one in (38). Hotelling discusses methods of integrating 
the density and also calculates moments of r. 

The cumulative distribution of r, 


(42) Pr(r x r*) =F(r* IN, р), 


has been tabulated by David (1938) for? р = 01.9, М = 3(1)25, 50, 100, 200, 
400, and r* = —1(.05)1. (David's n is our №.) It is clear from the density (38) 
that F(r*|N, p) = 1—F(—r*|N,— p) because the density for r, p is equal to 
the density for —r,— p. These tables can be used for a number of statistical 
procedures. 

First, we consider the problem of using a sample to test the hypothesis 


(43) H: p= ро. 


If the alternatives are р> ро, we reject the hypothesis if the sample correla- 
tion coefficient is greater than rg, where ғо is chosen so 1 — Е М, ро) = о, 
the significance level. If the alternatives are р < po, We reject the hypothesis 
if the sample correlation coefficient is less than ry, where ry is chosen so 
ЕСМ, ро) = a. Lf the alternatives are p # py; the region of rejection is r > г; 
and r <r, where r, and rj are chosen so [1 — F(r,|N, Pol + Е М, po) = a. 
David suggests that r, and ғ, be chosen so [1 — F(r,|N, р) = ЕСМ, ро) 
= 1а, She has shown (1937) that for N > 10, |p| < 0.8 this critical region is 
nearly the region of an unbiased test of H, that is, a test whose power 
function has its minimum at pọ. 

It should be pointed out that any test based on r is invariant under 
transformations of location and scale, that is, x*, = b;Xia + с, bj» 0, i= 1,2, 


*o = X.1).9 means p= 0,0.1,02, ..., 0.9. 


42 CORRELATION COEFFICIENT OF А BIVARIATE SAMPLE 127 


Table 4.1. A Power Function 


p Probability 
— 1.0 0.0000 
—0.8 0.0000 
-0.6 _ 0.0004 
— 0.4 0.0032 
— 0.2 0.0147 
0.0 0.0500 
0.2 0.1376 
0.4 0.3215 
0.6 0.6235 
0.8 0.9279 
1.0 1.0000 
a=1,...,N; and ғ is essentially the only invariant of the sufficient statistics 


(Problem 3.7). The above procedure for testing H: p — py against alterna- 
tives р> py is uniformly most powerful among all invariant tests. (See 
Problems 4.16, 4.17, and 4.18.) 

As an example suppose one wishes to test the hypothesis that р = 0.5 
against alternatives р + 0.5 at the 5% level of significance using the correla- 
tion observed in a sample of 15. In David's tables we find (by interpolation) 
that F(0.027|15,0.5) = 0.025 and F(0.805|15,0.5) = 0.975. Hence we reject 
the hypothesis if our sample r is less than 0.027 or greater than 0.805. 

Secondly, we can use David's tables to compute the power function of 
a test of correlation. If the region of rejection of H is r>r, and г<, 
the power of the test is a function of the true correlation p, namely 
[1 — F(r,|N, p) + [Е М, р)}; this is the probability of rejecting the null 
hypothesis when the population correlation is p. 

As an example consider finding the power function of the test for p=0 
considered in the preceding section. The rejection region (one-sided) is 
г> 0.5494 at the 5% significance level. The probabilities of rejection are 
given in Table 4.1. The graph of the power function is illustrated in Figure 
4.2. 

Thirdly, David's computations lead to confidence regions for p. For given 
N, r, (defining a significance point) is a function of p, say Ср), and r, is 
another function of p, say f>( p), such that 


(44) Pr(fi( p) «r«f( p)lp} =1-а. 


Clearly, (р) and f,(p) are monotonically increasing functions of p if ғ; 
and rj are chosen so 1 — F(r,|N. р) = За = F(N, p). И р= f; Gr is the 
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-1 0 1 


Figure 4.2. A power function. 


inverse of r=f,p), i= 1,2, then the inequality f,(p) <r is equivalent tot 
p <fi (r), and r<f,(p) is equivalent to f; !(r) < p. Thus (44) can be written 


(45) Pr(fz'(r) <p<fi'(r)lp} =1-а. 


This equation says that the probability is 1 — а that we draw a sample such 
that the interval (fy '(r), f7'(r)) covers the parameter p. Thus this interval i 
a confidence interval for p with confidence coefficient 1 — o. For a given N 
and a the curves r =f p) and r= f,( p) appear as in Figure 4.3. In testing 
the hypothesis P = po, the intersection of the line p = p, and the two curves 
gives the significance points r, and rj. In setting up a confidence region fo 
on the basis of a sample correlation r*, we find the limits fit) and 


Figure 4.3 


t . 
The point (f p), p) on the first curve i 
(rp). (р), р) on the first curve is to the left of (r, p), and the point (r, fy '(r)) is above 
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fy (r*) by the intersection of the line r=r* with the two curves. David gives 
these curves for а = 0.1, 0.05, 0.02, and 0.01 for various values of N. One- 
sided confidence regions can be obtained by using only one inequality above. 

The tables of ЕСМ, р) can also be used instead of the curves for finding 
the confidence interval. Given the sample value r*, fi (r*) is the value of р 
such that ba = Pr(r «r*1p) = ЕС", р), and similarly f; '(r*) is the value 
of p such that За = Pr(rz r*lp) - 1— F(r*|N, p). The interval between 
these two values of p, (f; (r*), f, '(r*)), is the confidence interval. 

As an example, consider the confidence interval with confidence coeffi- 
cient 0.95 based on the correlation of 0.7952 observed in a sample of 10. 
Usiag Graph II of David, we tind the two limits are 0.34 and 0.94, Hence we 
state that 0.34 < р < 0.94 with confidence 95%. 


Definition 4.2.1. Let L(x, 9) be the likelihood function of the observation 
vector x and the parameter vector Ө Є О. Let а null hypothesis be defined by a 
proper subset w of О. The likelihood ratio criterion is 


sup e , L(x, 9) 
46 Mx) = ——————ua- 
(46) у (x) supa e o L(x, 9) 


The likelihood ratio test is the procedure of rejecting the null hypothesis when 
Мх) is less than a predetermined constant. 


Intuitively, one rejects the null hypothesis if the density of the observa- 
tions under the most favorable choice of parameters in the null hypothesis is 
much less than the density under the most favorable unrestricted choice of 
the parameters. Likelihood ratio tets have some desirable features, see 
Lehmann (1959) for example. Wald (1943) has proved some favorable 
asymptotic properties. For most hypotheses concerning the multivariate 
normal distribution, likelihood ratio tests are appropriate and often are 
optimal. 

Let us consider the likelihood ratio test of the hypothesis that р= po 
based оп a sample x,,..., xy from the bivariate normal distribution. The set 
Q consists of Hy шо, 901, 05, and p such that 0,20, 0,2 0, -1<р<1. 
The set c is the subset for which р = ро. The likelihood maximized in Q is 
(by Lemmas 3.2.2 and 3.2.3) 


NN e^ 


йе 
ах - N SN ON 8 
(2m) (l-r) ац aay” 


(47) 
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Under the null hypothesis the likelihood function is 


(48) 1 ац/т+ 705 — 2р | 


where с? = суо, and т= о, /о,. The maximum of (48) with respect to т 
occurs at ? = ya, / уа», . The concentrated likelihood is 


(49) 1 =|- E e| 
(2т) (1 - pd)" (a?)" 0?(1-2) | 


the maximum of (49) occurs at 


(50) g2= ai 43,(1 — por) 
N(1 — рй) 


The likelihood ratio criterion 15, therefore, 


(51) max, L _ (1 - pd" (1 - r2y? _ (1—p3)(1—r?) aN 
7 тах о L (1- por)” (1- роғ)? | 


The likelihood ratio test is (1 — p2X1 — r?X1 — рог)? < c, where c is chosen 
so the probability of the inequality when samples are drawn from normal 
populations with correlation p, is the prescribed significance level. The 
critical region can be written equivalently as 


(32) ( pèc — pp + 1)? - 2pger c - 1 pj > 0, 
or 


. рос + (1— pil -c 


pic +1-— pd , 
(53) 
< рос - (1- р0)У1 с 
pctl-po ` 


Thus the likelihood ratio test of H : р = p, against alternatives р + pọ has a 
rejection region of the form r >r, and r < 1; but м, and rj are not chosen so 
that the probability of each inequality is a/2 when Н is true, but are taken 
to be of the form given in (53), where c is chosen so that the probability of 
the two inequalities is a. 
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4.2.3. The Asymptotic Distribution of a Sample Correlation Coefficient 
and Fisher's z 


In this section we shall show that as the sample size increases, a sample 
correlation coefficient tends to be normally distributed. The distribution of a 
particular function of a sample correlation, Fisher's z [Fisher (1921)], which 
has a variance approximately independent of the population correlation, 
tends to norma.ity faster. 

We are particularly interested in the sample correlation coefficient 


А, (п) 


09 109 7 ASQ As 


for some i and j, {=}. This can also be written 
Ci(n) 
55 (п) = —H—— M, 
65) y Cin) C;(n) 


where С, (n) = A, (0)/ y Ogg Onn- The set C,(n), C,(n), and С, (п) is dis- 
tributed like the distinct elements of the matrix 


n Zi. n Zi / Мот 
G9 È |z (Zez) Ex (Zia/ ou. 27 уо), 


a=1 ја a=1 


where 
ри 2, 
0107; 
Let 
Ci(n) 
(57) U(n) = = | Gil) |, 
Cjj(n) 
1 
(58) b-|l 
p 
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Then by Theorem 3.44 the vector Yn [U(n) — b] has a limiting normal 
distribution with mean 0 and covariance matrix 


2 2р? 2p 
(59) 202 2 2p 
2р 2p 1+р? 


Now we need the general theorem: 


Theorem 4.23. Let {U(n)} be a sequence of m-component random vectors 
and b a fixed vector such that Уп [U(n) — b] has the limiting distribution N(0, T) 
as n oo. Let f(u) be a vector-valued function of u such that each component 
f(u) has a nonzero differential at и = b, and let (и) Иди и-ь be the i, jth 


component of Ф,. Then Vn(flu(n)] —f(b)) has the limiting distribution 
N(0, ,T 6,). 


Proof. See Serfling (1980), Section 3.3, or Rao (1973), Section ба.2. А 
function g(u) is said to have a differential at b or to be totally differentiable 
at b if the partial derivatives др(и) /9ди, exist at и =b and for every e» 0 
there exists a neighborhood N,(b) such that 


(60) 


g(u) —g(b) – Y 00) (u—bj)|xelu-bl foral иЄ М, (Б). m 
i=l i 


It is clear that U(n) defined by (57) with b and T defined by (58) and (59), 
respectively, satisfies the conditions of the theorem. The function 


u 
(61) r= = изиг іц? 
уши; 


satisfies the conditions; the elements of ®, are 


or 1 23 а 1 

du, = ~ зизиү иу, = FP, 
1 |u=b 

or 1 23 Lt 1 

(62) Ou, = — 5изиү Puy lu- = — 3, 

2 lu-b 

or 1 1 

— “3w ?2| _, =] 

duy, y, n 2 luz 7 b 
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and f(b)- p. The variance of the limiting distribution of vn [r(n) — p] is 


2 2p 2p 
(63) (-ie-ip1)25 2 2» 
2p 2p 1+р? 1 


-3P 
—3p 
ij 
-ip 
-(p-php-plli-p))| -5p 
i 
=1- 2р? + р“ 
= (1 - py. 
Thus we obtain the following: 


Theorem 4.24. ЈҒ (п) is the sample correlation coefficient of a sample of N 
(=n + 1) from a normal distribution with correlation p, then Yn [r(n) — pl/ 
(1 - p?) [or YN Ir(n) — p]/( = p?)] has the limiting distribution N(0, 1). 


It is clear from Theorem 4.2.3 that if. f(x) is differentiable at x — p. then 
Vn [ f(r) — Кр» is asymptotically normally distributed with mean zero and 
variance 


A useful function to consider is one whose asymptotic variance is constant 
(here unity) independent of the parameter p. This function satisfies the 
equation 


1 1f 1 1 
(64) Ро) 3-315] 


Thus f(p) is taken as 4[log(1 + р) — log(1 — р) = 31021 + p)/(1 — р). The 
so-called Fisher's z is 


+ - 
(65) z= Mog tt = tanh! r, 


where r= tanh z ^ (e? — e ?)/(e? t e7). Let 


(66) ¿= орт: 
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Theorem 4.2.5. Let z be defined by (65), where r is the correlation coeffi- 
cient of a sample of N (=n +1) from a bivariate normal distribution with 
correlation р: let © be defined by (66). Then Vn (z С) has a limiting normal 
distribution with mean 0 and variance 1. 


It can be shown that to a closer approximation 


(67) ez~ g+ Ze 


> 1 2 
(68) 8(2- 6) ueri). 


The latter follows from 


(69) é&(z-f) == + 


and holds good for p?/n? small. Hotelling (1953) gives moments of z to order 
n^^. An important property of Fisher's z is that the approach to normality is 
much more rapid than for r. David (1938) makes some comparisons between 
the tabulated probabilities and the probabilities computed by assuming z is 
normally distributed. She recommends that for N » 25 one take z as nor- 
mally distributed with mean and variance given by (67) and (68). Konishi 
(19782, 1978b, 1979) has also studied z. [Ruben (1966) has suggested an 
alternative approach, which is more complicated, but possibly more accurate.] 
We shall now indicate how Theorem 4.2.5 can be used. 


a. Suppose we wish to test the hypothesis p = pg on the basis of a sample 
of N against the alternatives р + py. We compute r and then z by (65). Let 


(70) £o = 3log 


Then a region of rejection at the 5% significance | :vel is 
(71) YN — 3 |z - čal > 1.96. 


A better region is 


l 
(72) /N-3|z-&- 92207 |> 196. 


b. Suppose we have a sample of N, from onc population and a sample of 
М, from a second population. How do we test the hypothesis that the two 
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correlation coefficients ¿re equal, p; = p? From Theorem 4.2.5 we know 
that if the null hypothesis is true then 2; — z, [where z, and z, are defined 
by (65) for the two sample correlation coefficients] is asymptotically normally 
distributed with mean 0 and variance 1/(N, — 3) + 1/(N; — 3). As a critical 
region of size 5%, we use 


Iz — zj| 


AAKN-»31085-9 ” 


c. Under the conditions of paragraph b, assume that р, = р = p. How 
do we use the results of both samples to give a joint estimate of р? Since z, 
and z, have variances 1/(N,—3) апа 1/(N, – 3), respectively, we can 
estimate 4 by 


(73) 1.96. 


(№, – 3)z,+ (№ – 3) 2; 
74 urn im n 
(74) N,*N,-6 


and convert this to an estimate of р by the inverse of (65). 


d. Let r be the sample correlation from N observations. How do we 
obtain a confidence interval for p? We know that approximately 


(75) Pr{—1.96 < /N 3 (2-4) x 1.96} = 0.95. 


From this we deduce that [—1.96/ YN —3 - 2, 1.96/ YN —3 +2] 16 a confi- 
dence interval for ¢. From this we obtain an interval for p using the fact 
p=tanh (= (e! — e7*)/(ef - e *), which is a monotonic transformation. 
Thus a 9596 confidence interval 15 


(76) tanh(z — 1.96//N 3 ) < p < tanh(z + 1.96//N - 3). 


The bootstrap method has been developed to assess the variability of a 
sample quantity. See Efron (1982). We shall illustrate the method on the 
sample correlation coefficient, but it can be applied to other quantities 
studied in this book. 

Suppose x,,...,xy is a sample from some bivariate population not neces- 
sarily normal. The approach of the bootstrap is to consider these N vectors 
as a finite population of size N; a random vector X has the (discrete) 
probability 


(77) Pr{X=x,} 7 | a=1,...,N. 
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A random sample of size N drawn from this finite population has a probabil- 
ity distribution, and the correlation coefficient calculated from such а sample 
has a (discrete) probability distribution, say py(r). The bootstrap proposes to 
use this distribution in place of the unobtainable distribution of the correla- 
tion coefficient of random samples from the parent population. However, it is 
prohibitively expensive to compute; instead p, (r) is estimated by the empiri- 
cal distribution of r calculated from a large number of random samples from 
(77). Diaconis and Efron (1983) have given an example of N — 15; they find 
the empirical distribution closely resembles the actual distribution of r 
(essentially obtainable in this special case). An advantage of this approach is 
that it is not necessary to assume knowledge of the parent population; 
a disadvantage is the massive computation. 


4.3. PARTIAL CORRELATION COEFFICIENTS; 
CONDITIONAL DISTRIBUTIONS 


4.3.1. Estimation of Partial Correlation Coefficients 


Partial correlation coefficients in normal distributions are correlation coeffi- 
cients in conditional distributions. It was shown in Section 2.5 that if X is 
distributed according to N(p, X), where 


ХӘ и = po E- Eu Хр 
н? Zn I ' 


хо 
then the conditional distribution of X® given XO = x? is N[p© + B(x? — 
pO), X,.,], where 


(1) — X- 


(2) | В = х5 > 


(3) Уп = 2u- ХХХ. 


The partial correlations of X given x® are the correlations calculated in 
the usual way from Z,,;. In this section we are interested in statistical 
problems concerning these correlation coefficients. 

First we consider the problem of estimation on the basis of a sample of N 
from N(p, X) What are the maximum likelihood estimators of the partial 
correlations of X“ (of q components) P;j.q+1,....p? We know that the 
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maximum likelihood estimator of X is (1/N)A, where 


N zy 
(4) A= L (х. X)(, 7X) 


а=1 


e 


xh — xr xe — #2") 


ll 

4 = 
— 
MO 

to 
> = 
LL 

— 


and з Q/NDEN x = (z0 #0). The correspondence between X and 
311.2, В, and X is one-to-one by virtue of (2) and (3) and 


(5) У, = ВУ, 
(6) У = Хп + ВХР’. 


We сап now apply СогоПагу 3.2.1 to the effect that maximum likelihood 
estimators of functions of parameters are those functions of the maximum 
likelihood estimators of those parameters. 


Theorem 43.1. Let x,...,xy be a sample from Nip, x where " 
and X are partitioned as in (1). Define A by (4) and (x E у 
(му GO" xO"), Then the maximum likelihood estimators of w. wo. 


B, X1, and У, are eo = 50, pO =X, 
^ A, 1 -1 
(7) B=4,An, $57 (An 40424). 


and $. = (1/N)An, respectively. 


In turn, Corollary 3.2.1 can be used to obtain the maximum likelihood 
estimators of n, pO, В, X», Orgen up P= beot and Pig eism 
ij-1 q. It follows that the maximum likelihood estimators of the partial 
correlation coefficients are 


(8) 
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Theorem 4.3.2. Let xj,...,xy be a sample of N from Np, X) The 
maximum likelihood estimators of p;;., ,,, .... , the partial correlations of the first 
q components conditional on the last p — q components, are given by 


a 


^ u ibqtl,...p fo rol 
(9) Рена... Po , i,j-l...4, 
Gijg+l TEE pirqa p 
where 
; _ _ 214 
(10) (ai. 441... p) 7 4n Ain Az An 7 Aja 


au p $ called the sample 


having taken account of X, , ,..., Xp. Note that the calculations can be done 
in terms of (r;;). 
The matrix .4,,.. can also be represented as 


N 
(11) Am= У [zo — 20 - B(x — 2)] [2 - 2 - B(x? 2) 
а=1 


=A — ВА» В’. 


The vector x) — #0 — B(xO — x) is the residual of x{ from Из regression . 
а а а 


on x” and 1. The partial correlations are simple correlations between these 
residuals. The definition can be used also when the distributions involved are 
not normal. 

Two geometric interpretations of the above theory can be given. In 
p-dimensional space x,,...,xy represent № points. The sample regression 
function 


(12) х 2x04 B(x? — 29) 


is a (p — q)-dimensional hyperplane which is the intersection of q (p - 1)- 
dimensional hyperplanes, 


Р 
(13) x, =X, + У B(x; - Xj). і= 1,...,9, 
. =q+ 


where x;, x, are running variables. Here B; is an element of В = $5,$7] = 
A1, Аз. The ith row of Ê is ( Bi as eg Bip): Each right-hand side of (13) is 
the least squares regression function of x; on Хо... that is, if we 


project the points хь,..., ху on the coordinate hyperplane of x;, x,,,,. +> Xp» 
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t lg 4i slp 


Figure 4.4 


then (13) is the. regression plane. The point with coordinates 


p 
(14) x, =X, + i У Bi(xi« — X;), 1=1,...,4, 
ј=9 +1 
Xj =Xjas ј=4+1,...,р, 


is оп the hyperplane (13). The difference in the ith coordinate of х, and the 
point (14) is ум 7x;, — [x; + ХР ан BG, - x) for i=1,...,q and 0 for 
the other coordinates. Let y, = (y;,,..., y,,). These points can be repre- 
sented as № points in a q-dimensional space. Then 4. = EN y, y. 

We can also interpret the sample as p points in N-space (Figure 4.4). Let 
и, 7 (xj. .., xj4)' be the jth point, and let е = (1,..., 1)' be another point. 
The point with coordinates x,...,X, is X;e. The projection of и, on the 
hyperplane spanned by &,,,,...,u,, = is 


p 
(15) й,= Е + X В (и, - е); 
n j=q+1 


this is the point on the hyperplane that is at a minimum distance from u;. Let 


и* be the vector from й, to u;, that is, и, — й;, or, equivalently, this vector 


translated so that one endpoint is at the origin. The set of vectors uf;...,u% 
are the projections of u,...,u, оп the hyperplane orthogonal to 
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* = 
U, 45s Up, Е. Then uf 'и? = 0j, p> the length squared of и? (ie., the 


is the cosine of the angle between uj and uj. 

As an example of the use of partial correlations we consider some data 
[Hooker (1907)] on yield of hay (Xj) in hundredweights per acrc, spring 
rainfall (X;) in inches, and accumulated temperature above 42°F in the 
spring (X4) for an English area over 20 years. The estimates of р, o, 


(= ус), and p;; are | 


28.02 
&-x-| 4.91|, 
594 
ó, 442 
(16) 9, |= | 1.10 |, 
by 85 
1 pp bp 1.00 0.80 -0.40 
Pu 1 pyls} 080 1.00 -0.56 
pu рю 1 —0.40 —0.56 1.00 


From the correlations we observe that yield and rainfall are positively 
related, yield and temperature are negatively related, and rainfall and tem- 
perature are negatively related. What interpretation is to be given to the 
apparent negative relation between yield and temperature? Does high tem- 
perature tend to cause low yield, or is high temperature associated with low 
rainfall and hence with low yield? To answer this question we consider the 
correlation between yield and temperature when rainfall is held fixed; that is, 


we use the data given above to estimate the partial correlation between X, 
and X, with X, held fixed. It is 


(17) бо _ 0.097. 


732 = 
V 91129332 


Thus, : the effect of rainfall is removed, yield and temperature are positively 
correlated. The conclusion is that both high raninfall and high temperature 


increase hay yield, but in most years high rainfall occurs with low tempera- 
ture and vice versa. 


We compute with Ê as if it were X. 
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4.3.2. The Distribution of the Sample Partial Correlation Coefficient 


In order to test a hypothesis about a population partial correlation coefficient 
we want the distribution of the sample partial correlation coefficient. The 
partial correlations are computed from 4.5 =A, — А Az Ay (as indicated 
in Theorem 4.3.1) in the same way that correlations are computed from A. 
To obtain the distribution of a simple correlation we showed that A was 
distributed as ENZ! Z, Z,, where Z,,...,2Zy- are distributed independently 
according to М№0,Х) and independent of X (Theorem 3.3.2). Here 
we want to show that A, is distributed as EN ie- Up, where 
U,. ..., Uy i (5-4) aTe distributed independently according to №, Х|.) 
and independently of B. The distribution of a partial correlation coefficient 
will follow from the characterization of the distribution of Aj)... We state the 
theorem in a general form; it will be used in Chapter 8, where we treat 
regression in detail. The following corollary applies it to Ay).2, expressed in 
terms of residuals. 

Theorem 4.3.3. Suppose Y,,...,Y, are independent with Y, distributed 
according to NT w, , Ф), where w, is an r-component vector. Let H=} W, Wa- 
assumed nonsingular, G = У" Үю, H^! , and 


- m m 
(18) C= У, (Y, - G*,)(Y, - бн,)' = У Y.Y,- GHG'. 
а=1 а=1 


Then C is distributed as У" -(U,U,, where U,,...,U,., are independently 


a= 


distributed according to N(0, b) and independently of G. 


Proof. The rows of Y= (Y,,...,Y,) are random vectors in ап m-dimen- 
sional space, and the rows of W — (w,,...,,) are fixed vectors in that space. 
The idea of the proof is to rotate coordinate axes 50 that the last r axes are 
in the space spanned by the rows of W. Let E, = FW, where F is a square 
matrix such that FHF' = I. Then 


(9) EE - FIF =F Y вмур 
a=) 


= ЕНЕ’ =I. 


Thus the m-component rows of E, are orthogonal and of unit length. It is 
possible to find an (m — ғ) X m matrix E, such that 


E, 
(20) | E- В 
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is orthogonal. (See Appendix, Lemma A.4.2.) Now let И = YE' G.e., U,= 
Зв. ep tg): By Theorem 3.3.1 the columns of U =(U,,..-,U,,) are indepen- 
dently and normally distributed, each with covariance matrix Ф. The means 


are given by 

(21) ФИ = €YE' = ГИЕ' 
=ГЕСЕ (Е; E) 
=(0 ГЕ!) 


by orthogonality of E. To complete the proof we need to show that C 
transforms to £22 2| U, U;. We have 


а ^а 


т 


т 
(22) E Ку = YY' = UEE'U' = ШО = У 0,01. 
а= 1 | а=1 
-Note that 
(23) G -YW'H^ = UEE(F ')'F'F 
E, Е.Е 
-U Е. | Е: 
- o($)F = ПОЕ, 


where = (U, Ls U,). Then 


(24) GHG’ = U®FHF'U®™ = 090% = X UU. 
а=т-г+1 
Thus С is 
nt m m .m-r 
Q) EYY-GHG' = LUU- У UU- L UU, 
al ac] waar a= 
This proves the theorem. м 


It follows from the above considerations that when Г —0,the 20 = 0, and 
we obtain the following: 


Corollary 43.1. If Г = 0, the matrix GHG' defined in Theorem 4.3.3 is 
distributed as "И, и, where U, „а: Um are independently. dis- 
tributed, each according to N(0, Ф). 
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We now find the distribution of 4,;., in the same form. It was shown in 
Theorem 3.3.1 that A is distributed as Z/71Z, Z;, where Z,,...,Zy., are 


independent, each with distribution N(0, 5). Let Z, be partitioned into two 
subvectors of а and p — q components, respectively: 


(26) Z,- Ё | 


zo 


Then 4, = EL,Z0Z7". Ву Lemma 4.2.1, conditionally оп 20 = 
z(5,..., Z0, =2 ,, the random vectors 2%,...,2Ф., are independently 
distributed, with Z distributed according to N(Bz^, X,,,), where В = 
УХ and X,,2X,- E4,£X5X,4. Now we apply Theorem 4.3.3 with 
zo = Ү,, zP =W, N-1=m,p-q=r, В=Г, £ =È, Аџ = Y Y,Y,, 
Ay Az] =G, Ay = Н. We find that the conditional distribution of 4, — 
(Ap AgDAg (Az) = 41.5 given ZO 220, a=1,...,N—1, is that of 
ХУ! OPTOU, Uz, where U,..., Uy; р-у are independent, each with dis- 
tribution №0, 1.2). Since this distribution does not depend on {z}, we 
obtain the following theorem: 


Theorem 43.4. The matrix A,4,,— А, - ААА is distributed as 
ENDO OU, where U,,...,Uy_ 1; р-а) are independently distributed, each 
according іо №0, € .,), and independently of A, and А. | 


Corollary 4.3.2. If 5, = 0 (В = 0), then А, is distributed as 
(РФ Ц, and Ay, AnA, is distributed as YYZ (4, U,U,, where 
U,,...,Uy_, are independently distributed, each according to NO, Х |.) 


Now it follows that the distribution of r;,,,, ., based on N observations 
is the same as that of a simple correlation coefficient based оп N — (p — 4) 
observations with a corresponding population correlation value of pjj.941,..., р 

Theorem 4.3.5. If the cdf. of r,; based on a sample of N from a normal 
distribution with correlation р, is denoted by F(r|N, pj), then the cdf of 
the sample partial correlation ғ.у. based on a sample of М from a 
normal distribution with partial correlation coefficient рл... p is МАМ- 
(р-а), Pijqui, ph ` 


This distribution was derived by Fisher (1924). 


4.3.3. Tests of Hypotheses and Confidence Regions for Partial 
Correlation Coefficients 


Since the distribution of a sample partial correlation r;j.q+1,...,p based on a 
sample of N from a distribution with population correlation Pijq+1 
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equal to a certain value, p, say, is the same -as the distribution of a simple 
correlation г based on a sample of size № — (p — 4) from a distribution with 
the corresponding population correlation of p, all statistical inference proce- 
dures for the simple population correlation can be used for the partial 
correlation. The procedure for the partial correlation is exactly the same 


except that М is replaced by N — (p — q). To illustrate this rule we give two 
examples. 


Example 1. Suppose that on the basis of a sample of size N we wish to 
obtain a confidence interval for р;;. „+1, ...р- The sample partial correlation is 
Tij-q+1,...,p: The procedure is to use David's charts for N —(p — q). In the 
example at the end of Section 4.3.1, we might want to find a confidence 
interval for p,,., with confidence coefficient 0.95. The sample partial correla- 
tion is rj. = 0.759. We use the chart (or table) for N — (р — 4) = 20 – 1 = 19. 
The interval is 0.50 < p,;., < 0.88. 


Example 2. Suppose that on the basis of a sample of size N we use 
Fisher's z for an approximate significance test of p;;.941 


p = ро against 
two-sided alternatives. We let 


"EE 


z= Норт 55, 

(27) Tatl, P 
1 1+ ро 
£g = 21 £1 ро’ 


Then У№-(р-а) -3(2- £j) is compared with the significance points of 
the standardized normal distribution. In the example at the end of Section 
4.3.1, we might wish to test the hypothesis рз. = 0 at the 0.05 level. Then 
£y = 0 and у20 — 1 — 3 (0.0973) = 0.3892. This value is clearly nonsignificant 
(|0.3892| < 1.96), and hence the data do not indicate rejection of the null 
hypothesis. | 

To answer the question whether two variables x, and x; are related when 
both may be related to a vector x? = (x5,..., x) two approaches may be 
used. One is to consider the regression of x, on x, and x® and test whether 
the regression of x, on x, is 0. Another is t> test whether p12.3,..., p = 0. 
Problems 4.43—4.47 show that these approaches lead to exactly the same test. 


4.4. THE MULTIPLE CORRELATION COEFFICIENT 


4.4.1. Estimation of the Multiple Correlation Coefficient 


The population multiple correlation between one variate and a set of variates 
was defined in Section 2.5. For the sake of convenience in this section we 
shall treat the case of the multiple correlation between X, and the vector 
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XO =(X,,...,X,)'; we shall not need subscripts on R. The variables can 
always be numbered so that the desired multiple correlation is this one (any 
ürelevant variables being omitted). Then the multiple correlation in the 
population is 


R- В.В | B' 2B _ (Хр 9, 
() " V7uB 2B Fu Fu 
where В, о), and Èp are defined by 
и Fy 
X = , 
(0) И z 
(3) | B= Ly о. 


Given a sample x,..., xy (N > p), we estimate X by S=(N/(N— ПХ or 


1 N g би Fy 
(4) = JA45 у У (x-xX)x-x)- Sey $. | 
and we estimate В by ё = $ ба) = Aja... We define the sample multiple 
correlation coefficient by 


ary oR ar $- , -1 
ВВ [959290 _ aA» amn 
(9) R= би Fy ац 


That this is the maximum likelihood estimator of R is justified by Corollary 


3.2.1, since we can define К, о), €? as a one-to-one transformation of X. 
Another expression for R [sce (16) of Section 2.5] follows from 


iS] _ ual 


| 1- А2 = —R—- . 
(6) 6,121 аи A7) 


The quantities R and B have properties in the sample that are similar to 
those В and В have in the population. We have analogs of Theorems 2.5.2, 
2.5.3, and 2.54. Let Я. =X: (xD — x9), and xf, =Xia 5), be the 
residual. 


Theorem 4.4.1. The residuals хү, are uncorrelated in the sample with the 
components of x2, а = 1,..., М. For every ve tor a 


N А 2 N А n > 
(7) У [а = - f (x0 -x9)] = Y [а-я -a (xf —2)] А 
а=1 


а= 1 
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i "ux a= j imized for 
The sample correlation between x, , and a'x?, a=1,...,N, is maximized f 
а = 8. and that maximum correlation is R. 


Proof. Since the sample mean of the residuals is 0, the vector of sample 
` ? . . 
covariances between x*, and x is proportional to 


< О 2 ze {2} — z2y =g о "A, =0 
(8) X [0х =) -B(x -3 Jeg =x ) =a В'42 . 
The right-hand side of (7) can be written as the left-hand side plus 


N 2 
(9) E (в-а) (x - =>) 
a-1 


N 


- (B-a) È («P -я® яв) 


acl 


which is 0 if and only if a = 8. To prove the third assertion we consider the 
r a =(2)\12 д, = 2 А 
vector а for which ХХ [20х02 – #2)Р = Хи. В (x – 0), since the 


а=1 


correlation is unchanged when the linear function is mulitplied by a positive 
constant. From (7) we obtain 


i ау 2 ао тау 
(10) а= 2 У (xia —%))B (x2 -x?) + У [8 (x2 -x )l 
i 


ae 


N 
хар 2 У (х. —3)a' (x - x) + 
=l 


from which we deduce 


Д z -9V z 2. zOVf 
EN Qu, 7X) (хо - x9 )'a < (а 7X) (3$ x )B 


11) =< : —— 
( Jon V EX. a' (xt? - =>) Ja; V EX [e (9 - x 
o “wÊ 
yan VB'AnB 
which is (5). m" 


Thus ў, + BC? — x?) is the best linear predictor of x,, in the sample, 
and (jx is the linear function of x'? that has maximum sample correlation 
a 
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with x,,. The minimum sum of squares of deviations [the left-hand side of 


(7)] is 


N 
(12) У [Cria 71) — B'(x® #2)? =a, - 64,6 
а=1 


— — n -1 
=a а45 ayy 
= 411.2 


as defined in Section 4.3 with q = 1. The maximum likelihood estimator of 
03.5: 1$ био = a4,,5/N. It follows that 


(13) 63.57 (1-Е?) би. 


Thus 1 — R? measures the proportional reduction in the variance by using 
residuals. We can say that R? is the fraction of the variance explained by x. 
The larger R? is, the more the variance is decreased by use of the explana- 
tory variables in х0). 

In p-dimensional space x,,..., xy represent N points. The sample regres- 
sion function x, =X, + B(x — xO) is the (р — 1)-dimensional hyperplane 
that minimizes the squared deviations of the points from the hyperplane, the 
deviations being calculated in the x,-direction. The hyperplane goes through 
the point x. 

In N-dimensional space the rows of (x,,..., xy) represent p points. The 
N-component vector with ath component x;,— x, is the projection of the 
vector with ath component x,, on the plane orthogonal to the equiangular 
line. We have p such vectors; a'(x — x) is the oth component of a vector 
in the hyperplane spanned by the last p — 1 vectors. Since the right-hand side 
of (7) is the squared distance between the first vector and the linear 
combination of the last p — 1 vectors, В’(х@ — x?) is a component of the 


-vector which minimizes this squared distance. The interpretation of (8) is that 


the vector with ath component (x,, —%,)— (x — x) is orthogonal to 
each of the last p — 1 vectors. Thus the vector with ath component j'(x? – 
x) is the projection of the first vector on the hyperplane. See Figure 4.5. 
The length squared of the projection vector is 


Me 


(14) [ê (x2 -:e- В АВ 24422, 


1 


а 


and the length squared of the first vector is УА (ху, — 51) = aq. Thus R is 
the cosine of the angle between the first vector and its projection. 
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(x37 X1, 7775, мм 7X4) 


(m7, +++, хам ~E2) 
8. > = 
Bie, ---,x Vz) 


(x317£3, * Xan 7X3) 


1 


Figure 4.5 


In Section 3.2 we saw that the simple correlation coefficient is the cosine 
of the angle between the two vectors involved (in the plane orthogonal to the 
equiangular line), The property of R that it is the maximum correlation 
between x,, and linear combinations of the components of x(? corresponds 
to the geometric property that R is the cosine of the smallest angle between 
the vector with components x,, —X, and a vector in the hyperplane spanned 
by the other p — 1 vectors. 

The geometric interpretations are in terms of the vectors in the (N — 1)- 
dimensional hyperplane orthogonal to the equiangular line. It was shown in 
Section 3.3 that the vector (x; —X,,..., хм — Х;) in this hyperplane can be 
designated as (2,,,...,2; м_1), where the z;, are the coordinates referred to 
an (№ — 1)-dimensional coordinate system in the hyperplane. It was shown 
that the new coordinates are obtained from the old by the transformation 
Zia = EY bugxig, а= 1... N, where B=(b,,) is an orthogonal matrix 
with last row (1/ VN ,...,1/ YN). Then 


N N-1 
(15) ау — Y (х, -X)(x«-x)- X аа. 
a=] а=1 


It will be convenient to refer to the multiple correlatior. defined in terms of 
Zia as the multiple correlation without subtracting the means. 

The population multiple correlation R is essentially the only function of 
the parameters р, and X that is invariant under changes of location, changes 
of scale of X,, and nonsingular linear transformations of X (2), that is, 
transformations X? = cX, +d, X?* = CX™ +d. Similarly, the sample multi- 
ple correlation coefficient R is essentially the only function of X and $, the 


А < 
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sufficient set of statistics for p and X, that is invariant under these transfor- 
mations. Just as the simple correlation r is a measure of association between 
two scalar variables in a sample, the multiple correlation R is a measure of 
association between a scalar variable and a vector variable in a sample. 


4.4.2. Distribution of the Sample Multiple Correlation Coefficient 
When the Population Multiple Correlation Coefficient Is Zero 


From (5) we have 


D -1 
В? = dA an, 
(16) ofato, 
then 
-1 Ш -1 
2 ау am fu аА Qt luz 
(17) 1-К =1- ay ay ay 
and 


, R _ 24,4: an 
(18) 1- R? 7 . 


y 


For а = 1, Corollary 4.3.2 states that when В = 0, that is, when К = 0, 1.5 IS 
distributed as EYZ? V2 and а, Azaq is distributed as DAT} р Ve where 
V,...,Vy., are independent, each with distribution NO, vu Then 
4,,5/93,; and A yA 4/912 are distributed independently as X -varia- 
bles with N —p and p — 1 degrees of freedom, respectively. Thus 


р? N-p ay AR 4/912 . N-p 
= 1 


(19) . 1-R? p-1 dy.2/ 0112 P7 
2X NP 
xv, PT! 
= p-l.N-p 


has the F-distribution with p — 1 and N — p degrees of freedom. The density 
of Е is 


(20) 


3(N - )] о Hp-1)-1 p-l, 
mo mda) Л” НР 
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Thus the density of 


(21) R= 


гру 0)] 


А \ KN-p)-1 
TR- (М 70] 


RP=? (1-Е?) , 0<R<il. 


(22) 


Theorem 4.4.2. Let R be the sample multiple correlation coefficient |de- 
fined by (5)] between X, and XO? = (X,,..., Xp) based on a sample of N from 
N(p, X). If R= 0 [that is, if (055... 01,0 7 97 Bl then [R?/0 – RP): 
[CN — p)/Cp — 1) is distributed as F with p — 1 and N — p degrees of freedom. 


It should be noticed that p — 1 is the number of components of X ®© and 
that N -p— N - (p — 1) - 1. If the multiple correlation is between a compo- 
nent X, and q other components, the numbers are q and N —q— 1. 

It might be observed that К? (1 —R?) is the quantity that arises in 
regression (or least squares) theory for testing the hypothesis that the 
regression of X, on X3... X, is zero. 

If R0. the distribution of R is much more difficult to derive. This 
distribution will be obtained in Section 4.4.3. | 

Now let us consider the statistical problem of testing the hypothesis 
HR — 0 on the basis of a sample of N from Ми, X). [R is the population 
multiple correlation between X, and (X5, ..., X,).] Since В = 0, the alterna- 
tives considered are К > 0. 

Let us derive the likelihood ratio test of this hypothesis. The likelihood 
function is 


В 1 
23) М.) = ча 
( ) (к ) Qa) Pix ina 


The observations are given; L is a function of the indeterminates p*, £*. Let 
w be the region in the parameter space Q specified by the null hypothesis. 
The likelihood ratio criterion is 

max L(g;*,X*) 
ы*, Ltew 


max (p, £") ` 


p*.EX*en 


(24) A= 


15 - 
apj- X (к VET (х, в"), 
a-1 
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Here 0 is the space of p*, Z* positive definite, and œ is the region in this 
space where R'= yoz a/y = 0, that is, where Oy Uy Gay 7 0. 
Because X7! is positive definite, this condition is equivalent to ©, = 0. The 
maximum of L(p*, £*) over Q occurs at p* = fk x and Х* = $ -ü0/N)A 
= (ИМИ. GG, - xXx, — X)' and is 
—1pN .ipN 
(25) max L(pt, 5%) = Ме. 
ut, x*en (2m) " "| Al 2% 

In @ the likelihood function is 


1 


(26) L pt, E*leá =0)= 1 1 
( а) ) (2r y?" oti" 


1 N 
es|- 3 Y (c i/i 
а=1 


1 


" Kp-)N iw XP 
(2т) Pix pn 


N 
_ 1 X (x? _ poy ae? (x? _ p?*) 
. а=1 


The first factor is maximized at uf = р, =X, and of, = ой = (/N)a,, and 
the second factor is maximized at pO* = à? = 0 and $% =,= 
(1/N)A.). The value of the maximized function is 
iN „= 5№ Xp-1N 5,—-Xp-DN 
(3) шак 10и", х") = AO TONO 
и, X*eo (тай (2m)? А, [iN 


Thus the likelihood ratio criterion is [see (6)] 
AlN n 
(28) х= ТА Lac gy. 
аА?“ ( 


The likelihood ratio test consists of the critical region А < Ag, where Ag is 


chosen so the probability of this inequality when Ё = 0 is the significance 
level o. An equivalent test is 


(29) 1 AYN = RAD 1 AVN. 


Since [R?/(1 — RCN — p)/(p — 1)] is a monotonic function of К, an 
equivalent test involves this ratio being larger than a constant. When В = 0, 
this ratio has an F,_,, y. ,,-distribution. Hence, the critical region is 


R М-р 
(30) I Ri POL > F,-1,N-p(2), 


where F,., y. (o) is the (upper) significance point corresponding to the а 
significance level. 


152 SAMPLE CORRELATION COEFFICIENTS 


Theorem 4.4.3. Given a sample x,,...,xy from №, X), the likelihood 
ratio test at significance level a for the hypothesis К = 0, where К is the 
population multiple correlation coefficient between X, and (X>,..., Xp), is given 


by (30), where R is the sample multiple correlation coefficient defined by (5). 


As an example consider the data given at the end of Section 4.3.1. The 
sample multiple correlation coefficient is found from 


1 ra fas 
1.00 0.80 —0.40 
0.80 1.00 —0.56 
r —0.40 —0.56 1.00 
31) 1-R? = а = 0357. 
(31) 1.00 —0.56 
—0.56 1.00 


T5 1 r3 


Thus R is 0.802. If we wish to test the hypothesis at the 0.01 level that hay 
yield is independent of spring rainfall and temperature, we compare the 
observed [R?/(1 — К2)100 — 3)/(3 — 1)] = 15.3 with Р, (0.01) = 6.11 and 
find the result significant; that is, we reject the null hypothesis. 

The test of independence between X, and (X,,..., X) = XO" is equiva- 
lent to the test that if the regression of X, on x (that is, the conditional 
expected value of X, given X; —x,,..., X, —x,) is ш + BO — p™), the 
vector of regression coefficients is 0. Here B = 45а) is the usual least 
squares estimate of В with expected value В and covariance matrix 7.2477 
(when the X® are fixed), and a,,.,/(N — p) is the usual estimate of 0711.2. 
Thus [see (18)] 


К? М-р В’ АВ N-p 
32 eh 2E . 
(32) 1-R? p-] аи Р-1 


is the usual F-statistic for testing the hypothesis that the regression of X, on 
Xj...,X, is 0. In this book we are primarily interested in the multiple 
correlation coefficient as a measure of association between one variable and 
a vector of variables when both are random. We shall not treat problems of 
univariate regression. In Chapter 8 we study regression when the dependent 
variable is a vector. 


Adjusted Multiple Correlation Coefficient 

The expression (17) is the ratio of a,,.., the sum of squared deviations from 
the fitted regression, to a,,, the sum of squared deviations around the mean. 
To obtain unbiased estimators of су when B = 0 we would divide these 
quantities by their numbers of degrees of freedom, М-р and N- 1, 
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respectively. Accordingly we can define an adjusted multiple correlation coeffi- 
cient R* by 


3 @и.2/(М-Р) _ N-1 2p 
(33) D-RU-TEU NT = Nop (il R^), 


which is equivalent to 


2 ; PZ! 2 
4 R*°=R*— = (1 - К"). 
(34) Np cR) 
This quantity is smaller than R? (unless p = 1 or R? = 1). A possible merit to 
it is that it takes account of p; the idea is that the larger p is relative to N, 
the greater the tendency of R? to be large by chance. 


4.43. Distribution of the Sample Multiple Correlation Coefficient When the 
Population Multiple Correlation Coefficient Is Not Zero 


In this subsection we shall find the distribution of R when the null hypothe- 
sis R — 0 is not true. We shall find that the distribution depends only on the 
population multiple correlation coefficient R. 

First let us consider the conditional distribution of R?/(1— К?) = 
ajyAzlaq/a,., given ZO =10, а= 1,...,п. Under these conditions 
Z,..., Ziq are independently distributed, Z;, according to МВ’), o: 
where B = Хоу and оц. = 0, — Om 92: Ф. The conditions are those 
of Theorem 43.3 with Y, -Z,,. Г=В’, в, =20, r=p-1, ®= Tiz 
m =n. Then арз = 4 = ap Az ap corresponds to Ўт Y, Y; - GHG’. and 
4112/0112 has a x?-distribution with n —(p — 1) degrees of freedom. 
a y Az) aq = (Aj aq)’ Az (A2) an) corresponds to GHG" and is distributed 
as E „U2, a-n-(p— D + 1,...,n, where Var(U,) = оц. and 


awa? 


(35) &(U, pag ss 0) = ГЕТ", 


where FHF' =I [H =F '(F')'}. Then a(5427a(5/0,,. is distributed as 
LaU,/ Your)?» where Va(U,/ ус.) = 1 and 


2 
c £U, 1 rar’ 
36 Y | z | - rF(rF!y- 
(36) a-n-p*2 Von.2 91:2 95.2 


ВАв 
muB 


11:2 


Thus (conditionally) à,45;a,,/0,,.5 has а noncentral X -distribution with 
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р — 1 degrees of freedom and noncentrality parameter 8458/21. (See 
Theorem 5.4.1.) We are led to the following theorem: 


Theorem 4.4.4. Let К be the sample multiple correlation coefficient between 
Xo and ХӘ" = (21,.. Xp) based on N observations (x,,, x), .. ., Gr y, ХФ). 
The conditional distribution of (R2/(1 — КМ — p)/(p — 1) given x? fixed 
is noncentral F with p — 1 and N — p degrees of freedom and noncentrality 
parameter BA» 8/215. 


The conditional density (from Theorem 5.4.1) of F = [8° /(1 — ККМ — 
р)/(р — Dis 


(p- 1)ехр[ — 1845 8/21] 
Q0 урут -»)] 


m argota ha ФЕ" 


and the conditional density of W=R? is (af =((N — p)/Cp - DIA ~ 
w)7* dw) 


exp| 7 $B'4yB/ou2] 


KN-p)- 
6n РТ C 
[в orien va] 
11.2 А 
Eo атра] 


To obtain the unconditional density we need to multiply (38) by the density 
of Z®,...,Z® to obtain the joint density of W and 2%,...,2® and then 
integrate. with respect to the latter set to obtain the marginal density of W. 
We have 


B'A»B u pna Qu 
012 712 


(39) 
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Since the distribution of 20 is №, X), the distribution of &'ZO/ yo 
is normal with mean zero and variance "t 


0) «| #22.) ар 
yXua | 911.2 
= pp _ BX >В/си 
a В'5 В Е 1-g'Z,8/o, 
В? 
1-Я? 


Thus (В'4 Во 2) — R?)] has а y?-distribution with n degrees of 
reedom. Let А? /(1 – R?) = ф. Then В'4,,В/о. = 6x2. We compute 


(41) seca дда ) 


pe 9% =} 1 1 
= Fe u^e zbu in-1,5- iu 
| ap^ f du 
$^ .00 1 " А 
= ay —— ute ~al +o )u 
о ea 


ка) т 

аж фу" T(in) № 2" ГОтча) 

$" _TGnta) | 
(1+ф)"*" Pan) 


1 -1 ,-4 
mta le 2 dp 


Applying this result to (38), we obtain as the density of R? 


(1- 2) 77!(4 – gy? © (R? Be p24K 07 10+ы-1 2/1 
(42) rrr ) (R ) r эп + 
T[i(n -p * 1)]r(in) r (CEES A 


Fisher (1928) found this distribution. It can also be written 


(43) Г(2я)(1 - Ry" 
r[i(n-p-«n]r[iCp- 1] 


(62) 7901 - nece 


Еп, зп; (р - 1); К], 


where F is the hypergeometric function defined in (41) of Section 4.2. 
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Another form of the density can be obtained when n — p +1 is even. We 
have 


uy y О) бт) 


c Ш T[p-D-a] 


- ROUES, (annm 


pnial 


(2) 7 ue iy < (IR? R? QRIR'" Ги +в) 


Loa г | 10") 


ð en 


ln- = Tn 
=r 3. "1 (1-82 Е?) | 


=] 


The density is therefore 


(5) rT[z(n-p-* 1)] 


9 ia-prl), L, in 
(x) "101 — tR?R?) 


t= 


Theorem 4.4.5. The density of the square of the multiple correlation coeffi- 
cient, К, between X, and Х,,..., X, based on a sample of N=n +1 is given 
by (42) or (43) [or (45) in the. case of n — p -- 1 even), where R? is the 
corresponding population multiple correlation coefficient. 


The moments of К are 


he (1 - 2)" со (R2) P? (1n + p) 
(46) AR- ТриЯ) ры 


fa — RIVET ОК gay na d( R?) 
0 


а)" ES (RYT Gn + u)r[s(p *h- +] 
T(3n) -0 pir 3(р- 1) + r з(п+А) + 


The sample multiple correlation tends to overestimate the population 
multiple correlation. The sample multiple correlation is the maximum sample 
correlation between x, and linear combinations of x? and hence is greater 


44 THE MULTIPLE CORRELATION COEFFICIENT 157 


than the sample correlation between x, and B'x®; however, the latter is the 
simple sample correlation corresponding to the simple population correlation 
between x, and B'x®, which is R, the population multiple correlation. 

Suppose R, is the multiple correlation in the first of two samples and ё, 
is the estimate of В; then the simple correlation between x, and Bx? in 
the second sample will tend to be less than R, and in particular will be less 
than R,, the multiple ccrrelation in the second sample. This has been called 
“the shrinkage of the multiple correlation.” 

Kramer (1963) and Lee (1972) have given tables of the upper significance 
points of R. Gajjar (1967), Gurland (1968), Gurland and Milton (1970), 
Khatri (1966), and Lee (1917b) have suggested approximations to the distri- 
butions of А2 /(1 — R?) and obtained large-sample results. 


4.4.4. Some Optimal Properties of the Multiple Correlation Test 


Theorem 4.4.6. Given the observations x,, ..., xy from N(p, 5), of all tests 


` of R = 0 at a given significance level based on x and А = X. (x, - xXx, — X)' 


that are invariant with respect to transformations 


xt-2cQ 4d, ХО* = СХ ча, 
(97) а* = c?a a^, = сСа А, = CAC’ 
11 1 (D (D 2 2C, 


any critical rejection region given by R greater than a constant is uniformly most 
powerful. 


Proof. The multiple correlation coefficient R is invariant under the trans- 
formation, and any function of the sufficient statistics that is invariant is a 
function of R. (See Problem 4.34.) Therefore, any invariant test must be 
based on R. The Neyman- -Pearson fundamental lemma applied to testing 
the null hypothesis В = 0 against a specific alternative R- o > O tells us the 
most powerful test at a given level of significance is based on "the ratio of the 
density of R for В = Ry, which is (42) times 2 R [because (42) is the density of 
R?], to the density for R = 0, which is (22). The ratio is a positive constant 
times 

(ии) apne 
(5 b мгр Он“ l 


Since (48) is an increasing function of R for R > 0, the set of К for which 
(48) is greater than a constant is an interval of R greater than a constant. 
a 
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Theorem 4.4.7. On the basis of observations х\,...› Хм from N(p, У), оў 
all tests of = 0 at a given significance level with power depending only on К, the 
test with. critical region given by R greater than а constant is uniformly most 


powerful. 


Theorem 4.4.7 follows from Theorem 4.4.6 in the same way that Theorem 
5.6.4 follows from Theorem 5.6.1. 


4.5. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


4.5.1. Observations Elliptically Contoured 


Suppose x,...., xy are № independent observations on a random p-vector X 
with density 
(1) IAI 8 [Gc v)' A7! (x 7 v)]- 


The sample covariance matrix 5 is an unbiased estimator of the covariance 
matrix X = [£R?/p]A, where В? = (X — v» A^ (X — v) and ER? < оо. Ап 
estimator of. p; = 2/ Yon) = ААА в rij 7 5/ ysiisii: ij= 
1,..., p. The small-sample distribution of r, is in general difficult to obtain, 
but the asymptotic distribution can be obtained from the limiting normal 
distribution of YN (S — X) given in (13) of Section 3.6. 

First we prove a general theorem on asymptotic distributions of functions 
of the sample covariance matrix 5 using Theorems 4.2.3 and 3.6.5. Define 


(2) з = vec S, о = vec X. 


Theorem 4.5.1. Let f(s) be a vector-valued function such that each compo- 
nent of f(s) has a nonzero differential at s = ©. Suppose S is the covariance of a 
sample from (1) such that € R* < оо. Then: 


(з) NU) - Co] = GS IN (s — 0) o Q0) 


4 мо, fe) [21+ «)(2@X)+ coo} SS j} 


Corollary 4.5.1. If - 
(4) f(es) =f(s) 


for all c > 0 and all positive definite S and the conditions of Theorem 4.5.1 hold, 
then 


(5) VNIG) -f(o)] 5 vec " x) 2009) (x axy( £6]. 
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Proof. From (4) we deduce 


- of(cs) | óf(cs) 9( ð 

(6) ? óc — 3n ales) = IS) , 
That is, 

(7) 209) с = 0. E 


The conclusion of Corollary 4.5.1 can be framed as 
УМ П 
(8) Tran UO) -f(o)] SN 0,229) (s ® x (2C) | 


The limiting normal distribution in (8) holds in particular when the sample is 
drawn from the normal distribution. The corollary holds true if « is replaced 


^ 


by a consistent estimator R. For example, a consistent estimator of 1-- & 
given by (16) of Section 3.6 is 


N 
(9) 1+8 Y [Cra = 3) 57 Gr, - Вр  2)]. 


A sample correlation such as К-т, = 5/ Y Susy or a set of such 
correlations is a function oi S that is invariant under scale transformations; 
that is, it satisfies (4). 


Corollary 4.5.2. Under the conditions of Theorem 4.5.1, 


- ГМ (ry Py) a 
(10) LIU IU МОТ. 
1+к ПЕЧ (0,1) 


As in the case of the observations normally distributed, 


N 1]. ltr, 1, 1+ру\ а 
11 — | јер юр 
Qn TEM pr 22) + мо), 


Of course, any improvement of (11) over (10) depends on the distribution 
samples. 


Partial correlations such as r; 


c ig+t,...p 1 = 1,...,9, are also invariant 
functions of 5. 


Corollary 4.5.3. Under the conditions of Theorem 4.5.1, 


d 
(12) 1+& Cii TD p^ Ёіј.4+1,..., р) > №0, 1). 
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Now let us consider the asymptotic distribution of R?, the square of the 
multiple correlation, when R?, the square of the population multiple correla- 
tion, is 0. We use the notation of Section 4.4. R? = 0 is equvialent to 9, = 9. 
Since the sample and population multiple correlation coefficients between 
X, and ХО = (X,,..., X,) are invariant with respect to linear transforma- 


tions (47) of Section 44, for purposes of studying the distribution ' 


of R? we can assume p. = 0 and X =Г,. In that case s; 51, Say 4 0, and 
$5; 5 1,_,. Furthermore, for k, i + 1 and j =/= 1, Lemma 3.6.1 gives 


, 1 к 
(13) 858) = (5 + HE) 


Theorem 4.5.2. Under the conditions of Theorem 4.5.1 


ГМ а 
(14) Tix Su 7 №(0, 1,1). 


Corollary 4.5.4. Under the conditions of Theorem 4.5.1 


NR?  NsS5sq, а 
as ne _ Мба e 
1+к (lt+k)sy P 


4.5.2. Elliptically Contoured Matrix Distributions 


Now let us turn to the model 
(16) . IAI ~ g[tr(X - env) A (X- ey v)] 


based on the vector spherical model g(tr Y’Y). The unbiased estimators 
of v and X -(&R?/p) A are x = (1/N)X'e, and S=(1/n)A, where А = 
(X-£,X)(X— ey X). 

Since 


(17) (X-eyv')'(X-eyv') -A- N(GE-v)(X-v), 


A and x are a complete set of sufficient statistics. . 

The maximum likelihood estimators of v and A are ? =# and A= 
(рии А. The maximum likelihood estimator of р; = А,;/ ЛА, = 
Oj V 065; ÎS Big = aij/ уана = Sij/ үи) (Theorem 3.6.4). 

The sample correlation r,, is a function f(X) that satisfies the conditions 
(45) and (46) of Theorem 3.6.5 and hence has the same distribution for an 
arbitrary density g[tr(-)] as for the normal density g[trC)] = const. e7 "©. 
Similarly, a partial correlation ;; and a multiple correlation R? 


ijq*1,..., р 
satisfy the conditions, and the conclusion holds. 
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Theorem 4.5.3. When X has the vector elliptical density (16), the distribu- 
tions of ri, rjj 4,4, and В? are the distributions derived for normally distributed 
observations. 


It follows from Theorem 4.5.3 that the asymptotic distributions of Tije 
аъ, рэ and К? are the same as for sampling from normal distributions. 

The class of left spherical matrices Y with densities is the class of g(Y'Y). 
Let Х= УС’ + &yv', where C'A'!C-I, that is, A = СС'. Then X has the 
density 


(18) IC ^g[c (x - eye’) (X -env (С) |. 
We now find a stochastic representation of the matrix Y. 


Lemma 4.5.1. Let V —(w,,...,v,), where v, is an N-component vector. 
i — 1,..., p. Define recursively w, =v}, 


(19) W,—v,— TW, i-2,.... р. 


Let u; = w;/|w;l. Then lujl ^ 1, i=1,..., p, and ии, = 0, ї +j. Further, 
(20) V-UT', 


where О —(u,..., uy t;—lwll i—1....p; =н Дю =ош, j= 
L.oi-lLi-il..piandtj-0.i«j. 


The proof of the lemma is given in the first part of Section 7.2 and as the 
Gram-Schmidt orthogonalization in the Appendix (Section A.5.1). This 
lemma generalizes the construction in Section 3.2; see Figure 3.1. See also 
Figure 7.1. 

Note that T is lower triangular, U'U = 1, and V'V=TT'. The last 
equation, /;; > 0, i — 1,..., p, and 1;; = 0, i<j, can be solved uniquely for T. 
Thus Т is a function of V'V (and the restrictions). 

Let Y (N Xp) have the density g(Y’Y), and let O, be an orthogonal 
NXN matrix. Then Y* = O,Y has the density g(Y*'Y*). Hence У* = 
OpY £Y. Let Y* = U*T*', where t$ > 0, i— 1,..., p, and гў = 0, i<j. From 
Y*'Y* =У'У it follows that T* T*' = ТТ’ and hence T* = T, Y* = U*T, and 
U* = OU € U. Let the space of U (N x p) such that U'U =], be denoted 
O(N x p). 


Definition 4.5.1. ЈҒ U (N Xp) satisfies U'U=1, and OU ÉU for all 
orthogonal Oy, then U is uniformly distributed on O(N X p). 
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The space of U satisfying U'U =I, is known as a Steifel manifold. The 
probability measure of Definition 4.5.1 is known as the Haar invariant 
distribution. The property O,U £ U for all orthogonal Оу defines the (nor- 
malized) measure uniquely [Halmos (1956)]. 


Theorem 4.5.4. If Y (N Xp) has the density g(Q'Y), then U defined by 
Y -UT', UU-L, t; 2 0, i21... p. and tjj — 0, i «j, is uniformly dis- 
tributed on O(N X p). 


The proof of Corollary 7.2.1 shows that for arbitrary g(-) the density of 
T is 


a 


(21) (ciN 1 7 0] 4" ве IT’), 


i- 


— 


where C(-) is defined in (8) of Section 2.7. ol 
The stochastic representation of Y (N X p) with density g(Y'Y) is 


(22) Y=UT', 


where U (N Xp) is uniformly distributed on O(N Xp) and T is lower 
triangular with positive diagonal elements and has density (21). 


Theorem 4.5.5. Let f(X) be a vector-valued function of X (N Xp) such 
that 


(23) f(X+eyv’) =f(X) 
for all v and 
(24) f(XG') =f(X) 


for all G Cp X p). Then the distribution of f(X) where X has an arbitrary density 
(18) is the same as the distribution of f (X) where X has the normal density (18). 


Proof. From (23) we find that РОХ) =ЛОГС'), and from (24) we find 
f(¥C’) = f(UT'C?) = fU), which is the same for arbitrary and normal densi- 
ties (18). a 


Corollary 4.5.5. Let f(X) be a vector-valued function of X (N x p) with 
the density (18), where у = 0. Suppose (24) holds for all G (p X p). Then the 


distribution of f (X) for an arbitrary density (18) is the same as the distribution of 


f(X) when X has the normal density (18). 
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The condition (24) of Corollary 4.5.5 is that f(X) is invariant with respect 
to linear transformations X  XG. 
The density (18) can be written as 


(25) ICI "'g[C мъ) - v) (0 7], 


which shows that 4 and x are a complete set of sufficient statistics for 
А = СС’ and v. 


PROBLEMS 


4.1. (Sec. 4.2.1) Sketch 


T[$(N - 1)] — „2 
гм) 6 7) 


ky(r) = 10-4) 


for (a) М=3, (b) N —4, (c) N=5, and (d) №= 10. 


42. (Sec. 4.2.1) Using the data of Problem 3.1, test the hypothesis that X, and X, 
are independent against all alternatives of dependence at significance level 0.01. 


4.3. (Sec. 4.2.1) Suppose a sample correlation of 0.65 is observed in a sample of 10. 
Test the hypothesis of independence against the alternatives of positive correla- 
tion at significance level 0.05. 


4.4. (Sec. 4.2.2) Suppose a sample correlation of 0.65 is observed in a sample of 20. 
Test the hypothesis that the population correlation is 0.4 against the alternatives 
that the population correlation is greater than 0.4 at significance level 0.05. 


4.5. (Sec. 4.2.1) Find the significance points for testing p — 0 at the 0.01 level with 
М = 15 observations against alternatives (a) p + 0, (b) р> 0, and (c) p « 0. 


4.6. (Sec. 4.22) Find significance points for testing р = 0.6 at the 0.01 level with 
М = 20 observations against alternatives (a) р 0.6, (b) p> 0.6, and (c) p < 0.6. 


4.7. (Sec. 4.2.2) Tablulate the power function at p= —1(0.2)1 for the tests in 
Problem 4.5. Sketch the graph of each power function. 


4.8. (Sec. 4.2.2) Tablulate the power function at p= — 1(0.2)1 for the tests in 
Problem 4.6. Sketch the graph of each power function. 


4.9. (Sec. 4.2.2) Using the data of Problem 3.1, find a (two-sided) confidence 
interval for p,; with confidence coefficient 0.99. 


4.10. (Sec. 4.2.2) Suppose N = 10, г = 0.795. Find a one-sided confidence interval 
for p [of the form (ro, 1)] with confidence coefficient 0.95. 
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4.11. | 
4.12. 
4.13. 
4.14. 


4.15. 


4.16. 


4.17. 


4.18. 


4.19. 


4.20. 
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(Sec. 4.2.3) Use Fisher’s z to test the hypothesis р = 0.7 against alternatives 
р= 0.7 at the 0.05 level with г = 0.5 апа N = 50. 


(Sec. 4.2.3) Use Fisher's z to test the hypothesis p; = p, against the alterna- 
tives p, * p, at the 0.01 level with г, = 0.5, №, = 40, ғ, = 0.6, №, = 40. 


(Sec. 4.2.3) Use Fisher's 2 to estimate р based on sample correlations of — 0.7 
(N = 30) and of —0.6 (N = 40). 


(Sec. 4.2.3) Use Fisher's z to obtain a confidence interval for р with сопй- 
dence 0.95 based on a sample correlation of 0.65 and a sample size of 25. 


(Sec. 42.2). Prove that when N = 2 and р=0, Pr(r- 1) = Pr{r = – 1) = 1. 


(Sec. 42) Let kp(r, p) be the density of thc sample corrclation coefficient г 
for a given value of p and N. Prove that r has a monotone likelihood ratio; that 
is, show that if p, > р, then ky(r, p,)/ky(r, p2) is monotonically increasing іп 
r. [ Hint: Using (40), prove that if 


o 


F[3.3;n+453(1 + pr)] = È, call +per)“ = 8(г,р) 


а=0 


has a monotone ratio, then ky(r, р) does. Show 


o? Y? вос. Св[(а- BY ro (a+ В) (1 ro) ^? 
353; log &(r, p) = eee? 
P 212° oca (1 +)“ l 


if (2?/8pór)log g(r, p) > 0, then g(r, р) has a monotone ratio. Show the 
numerator of the above expression is positive by showing that for each « the 
sum on В is positive; use the fact that c,,, < 3c, ] 


(Sec. 4.2) Show that of all tests of ру against a specific p, (> ро) based on г, 
the procedures for which r > c implies rejection are the best. [ Hint: This follows 
from Problem 4.16.] 


(Sec. 4.2) Show that of all tests of p= ру against р> pọ based on г, a 
procedure for which r >c implies rejection is uniformly most powerful. 


(Sec. 4.2) Prove г has a monotone likelihood ratio for r > 0, р> 0 by proving 
h(r) =ky(r, ру) /kyGr, p2) is monotonically increasing for ру > рз. Here h(r) is 
a constant times (15 oc, p?r*)/(7,.oc, esr"). In the numerator of A'(r), 
show that the coefficient of r° is positive. 


(Sec. 4.2) Prove that if X is diagonal, then the sets r;; and a; are indepen- 
dently distributed. [ Hint: Use the facts that r,; is invariant under scale transfor- 
mations and that the density of the observations depends only on the а,;.] 
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4.21. 


4.22. 


4.23. 


4.24. 


4.25. 


(Sec. 4.2.1) Prove that if p — 0 


go -TWN D]r(m + 2) 
УтГ[3(М-1) +m] ` 


(Sec. 4.2.2) Prove (р) апа f,( p) are monotonically increasing functions 
of p. 


(Sec. 4.22) Prove that the density of the sample correlation r [given by 
(38)] is 


n-1 po 2-3 "1 de 
т (1 7? y к I (17 pr) v1 =x? 


LHint: Expand (1 — prx)" in a power series, integrate, and use the duplication 
formula for the gamma function.] 


(Sec. 4.2) Prove that (39) is the density of г. [Hint: From Problem 2.12 show 


-ir 
[Ole Pt ay de = gos (-x) 
07) 1-x? 


Then argue 


d^"! cos !(-x) 
dx"! 1-х? | 


со „00 2 
Í, Í (Qz y T! e710 222 dy dz = 


Finally show that the integral of (31) with respect to a}, (= y?) and аз (727) is 
(39)] 

(Sec. 42) Prove that (40) is the density cf г. [Hint: In (31) let а = иет and 
а. = ue"; show that the density of v (0 < v < 9) and r (1 <r <1) is 


ga eMC = ery PY Mo ta = wy + рдо] 


Use the expansion 


А T+ 3) Ly 
07» Ee гл” 


Show that the integral is (40).] 
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4.26. (Sec. 4.2) Prove for integer й 


(РМ & oye"! Phadre?) 
Fat (in) goo Св. Tn Be) 


2 1 
gr = 


(=p?) & py” (ine g)T(h B+ 5) | 
УГ) d QB) — TGnth+B) 


er — 


S istributi if X and Y are independently dis- 

4.27. (Sec. 42) The t-distribution. Prove that if lently d 
tributed, X having the distribution N(0,1) and Y having the x?-distribution 
with т degrees of freedom, then W=X/yY/m has the density 


Tent [| И Ey 
Иен) mi 


[ Hint: In the joint density of X and Y, let x = tw tm? and integrate out w.] 


4.28. (Sec. 4.2) Prove 


1-2 gh)" © gimp 1*8]. 
ér- “Thy E, BIT[n e B 1] 


[ Hint: Use Problem 4.26 and the duplication formula for the gamma function.] 


4.29. (Sec. 4.2) Show that vn ( uT pj GD = (1,2), (1,3), (2,3), have a joint limit 
ing distribution with variances (1 — på)? and covariances of r;; and т, j * 


2 2 — 52 2 
being $Q pj, — Pij e X1 — pi — Pik T Pj) + Рі. 


4.30. (Sec. 4.3.2) Find a confidence interval for p,3.. with confidence 0.95 based on 
гуу. = 0.097 and N = 20. 


431. (Sec, 4.3.2) Use Fisher's z to test thc hypothesis р.м = 0 against alternatives 
iy, #0 at significance level 0.01 with 72.344 = 0.14 and М = 40. 


4.32. (Sec. 43) Show that the inequality rj,., <1 is the same as the inequality 
|" > 0, where |7;;| denotes the determinant of the 3 х 3 correlation matrix. 
ty — Hd 


4.33. (See. 4.3) Invariance of the sample partial correlation coefficient: Prove that 
Fina is invariant under the transformations x7, = 4&;Xia + bjix; +C; a, > 0, 


4.34. (Sec. 4.4) Invariance of the sample multiple correlation coefficient. Prove that R 
is a function of the sufficient statistics X and S that is invariant under changes 
of location and scale of x,, and nonsingular linear transformations of Xa (that 

DE = Cx? - every function of x 
is. xf, = cx, + d, xp Cx +d, a=1,...,N) and that every 
and $ that is invariant is a function of R. 
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4.35. (Sec. 4.4) Prove that conditional on Z,,=z,,, а=1,...,п, R?/( — R2) is 


4.36. 


4.37. 


4.38. 


4.39. 


4.40. 


4.41. 


4.42. 


distributed like Т?/(№* — 1), where T? = N*x'S-!x based on № = n observa- 
tions on a vector X with p* =p ~ 1 components, with mean vector (c/c, DOU 
(пс? = Ez?) and covariance matrix E., = X4— UTT aTa. (Hint: The 
conditional distribution of Z given Z,,=z,, is N[G/a1)92,,, £2]. 
There is an n X n orthogonal matrix B which carries (z,),...,2,,) into (c,..., c) 
and (Zj,...,Z,,) into (Yi,..., Yu, {= 2,...,р. Let the new X, be 


(у...) 


(Sec. 4.4) Prove that the noncentrality parameter in the distribution in Prob- 
lem 4.35 is (1/9) R?/(1 - R?). 


(Sec. 4.4) Find the distribution of R?/(1 — R?) by multiplying the density of 
Problem 4.35 by the density of a, and integrating with respect to ay. 


(Sec. 44) Show that thc density of г? derived from (38) of Section 4.2 is 
identical with (42) in Section 4.4 for p =2. [ Hint: Use the duplication formula 
for the gamma function.] 


(Sec. 4.4) Prove that (30) is the uniformly most powerful test of В = 0 based 
on г. [Hint: Use the Neyman- Pearson fundamental lemma.] 


(Sec. 4.4) Prove that (47) is the unique unbiased estimator of R? based on R?. 


The estimates of р. and X in Problem 3.1 are 


X-(185.72 151.12 183.84 149.24)’, 


95.2953 52.8683 . 69.6617 46.1117 
S= 52.8683. . 54.3600 1 51.31 17 35.0533 
69.6617 51.3117 ; 100.8067 56.5400 
46.1117 35.0533: 56.5400 45.0233 


(а) Find the estimates of the parameters of the conditional distribution of 
(x3, x4) given (ху, x2); that is, find 85,51 and $5, = $5 — 5151152 

(b) Find the partial correlation гда. 2. 

(c) Use Fisher's z to find a confidence interval for P34-12 With confidence 0.95. 

(d) Find the sample multiple correlation coefficients between x, and (xj, x2) 
and between x, and (x,, x2). 


(e) Test the hypotheses that x4 is independent of (x p X3) and x, is indepen- 
dent of (x, x2) at significance levels 0.05. 


Let the components of X correspond to scores on tests in arithmetic Speed 
(X), arithmetic power (Х,), memory for words (X3), memory for meaningful 
symbols ( X4), and memory for meaningless symbols (Х.). The observed correla- 
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tions in a sample of 140 are [Kelley (1928)] 


1.0000 0.4248 0.0420 0.0215 0.0573 
0.4248 1.0000 0.1487 0.2489 0.2843 
0.0420 0.1487 1.0000 0.6693 0.4662 
0.0215 0.2489 0.6693 1.0000 0.6915 
0.0573 0.2843 0.4662 0.6915 1.0000 


(a) Find the partial correlation between X, and Х;, holding XX, fixed. 


(b) Find the partial correlation between X, and Х,, holding X4, X4, and X; 
fixed. 


(c) Find the multiple correlation between X, and the set X,, X. А and Х;. 


(d) Test the hypothesis at the 1% significance level that arithmetic speed is 
independent of the three memory scores. 


4.43. (Sec. 4.3) Prove that if руат, ..р= 0, then УМ-2- (р-а) плат, ip, 


1- ан es р isdistributed according to the t-distribution with N — 2 — (p — q) 
degrees of freedom. 


4.44. (Sec. 43) Let X' = (X,, Х,, XO) have the distribution Ми, X). The condi- 
tional distribution of X, given X, =x, and XO =х® is 


N| i + 2022 — из) + "(2 0), оз, 


where 


The estimators of уз and y are defined by 


5) ау с 12 
ар А» || є an у 
Show c; = 432.3,..., ,/422.5,..., p | Hint: Solve for c in terms of c; and the а%, and 


substitute.] 


445. (Sec. 4.3) In the notation of Problem 4.44, prove 
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Hint: Use 


[2 


2) аз Aa | | < 
y. =a (c € . 
11.2... р 7 854 (C2 а» Аз | {с 


4.46. (Sec. 4.3) Prove that 1/а22-3....› is the eiement in the upper left-hand corner 


of І 
" -! 
422 Bay 
аз Ax | 
4.47. (Sec. 4.3) Using the results in Problems 4.43-4.46, prove that the test for 
p = 0 is equivalent to the usual t-test for y; = 0. 


4.48. Missing observations. Let X = (Y' Z')', where Y has p components and Z has q 
components, be distributed according to N(p, X), where 


[в i- EX, Xy 
MEE E, Za] 
Let M observations be made on X, and N — M additional observations be made 
on Y. Find the maximum likelihood estimates of x and X. [Anderson (1957).] 


[ Hint; Express the likelihood function in terms of the marginal density of Y and 
the conditional density of Z given Y.] 


4.49. Suppose X is distributed according to N(0, Z), where 


Show that on the basis of one observation, х’ = (x), X2, Хз), we can obtain a 
confidence interval for p (with confidence coefficient 1 — а) by using as end- 
points of the interval the solutions in £ of 


[3 + F(a) |e? — 2¢ xy +з): +x? 4х2 +3 — x(a) = 0, 


where x2(a) is the significance point of the x?-distribution with three degrees 
of freedom at significance level a. 


CHAPTER 5 


The Generalized T’-Statistic 


3.1. INTRODUCTION 


One of the most important groups of problems in univariate statistics relates 
to the mean of a given distribution when the variance of the distribution is 
unknown. On the basis of a sample one may wish to decide whether the 
mean is equal to a number specified in advance, or one may wish to give an 
interval within which the mean lies. The statistic usually used in univariate 
statistics is the difference between the mean of the sample x and the 
hypothetical population mean 4 divided by the sample standard deviation s. 
If the distribution sampled is № џи, а”), then 


А х= р 
u) ГУМ 


has the well-known ¢-distribution with N — 1 degrees of freedo n, where N is 
the number of observations in the sample. On the basis of this fact, one can 
set up a test of the hypothesis д = ио, where jy is specified, or one can set 
up a confidence interval for the unknown parameter 4. 

The multivariate analog of the square of г given in (1) is 


(2) Т? «N(X-gp)'S"(x- p), 


where x is the mean vector of a sample of N, and S is the sample covariance 
matrix. It will be shown how this statistic can be used for testing hypotheses 
about the mean vector м of the population and for obtaining confidence 
regions for the unknown p. The distribution of T? will be obtained when p 
in (2) is the mean of the distribution sampled and when p is different from 


An Introduction to Multivariate Statistical Analysis. Third Edition. Ву T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 


170 


5.2 DERIVATION OF THE 7?-STATISTIC AND ITS DISTRIBUTION 171 


the population mean. Hotelling (1931) proposed the T?-statistic for two 
samples and derived the distribution when p is the population mean. 

In Section 5.3 various uses of the T?-statistic are presented, including 
simultaneous confidence intervals for all linear combinations of the mean 
vector. A James-Stein estimator is given when X is unknown. The power 
function of the T?-test is treated in Section 5.4, and the multivariate 
Behrens—Fisher problem in Section 5.5. In Section 5.6, optimum properties 
of the 7?-test are considered, with regard to both invariance and adirnissibil- 
ity. Stein's criterion for admissibility in the general exponential family is 
proved and applied. The last section is devoted to inference about the mean 
in elliptically contoured distributions. у 


5.2. DERIVATION OF THE GENERALIZED T?-STATISTIC 
AND ITS DISTRIBUTION 


52.1. Derivation of the T?-Statistic As a Function of the Likelihood 
Ratio Criterion 


Although the T?-statistic has many uses, we shall begin our discussion by 
showing that the likelihood ratio test of the hypothesis H : p = Po on the 
basis of a sample from N(p, X) is based on the T?-statistic given in (2) of 


Section 5.1. Suppose we have № observations x,,..., xy (N >p). The likeli- 
hood. function is 


РМ -iN y - 
(1) L(p, X) = 22) "xi P exp - 5 У (х. B), в). 


The observations are given; L is a function of Ње indeterminates p, X. (We 
shall not distinguish in notation between the indeterminates and the parame- 
ters.) The likelihood ratio criterion is 


тах (но, X) 


(2) A= 


max L(p,, X) ’ 
Bn. x 


that is, the numerator is the maximum of the likelihood function for p, X in 
the parameter space restricted by the null hypothesis (и = Мо, & positive 
definite), and the denominator is the maximum over the entire parameter 
space (X positive definite). When the parameters are unrestricted, the maxi- 
mum occurs when p, € are defined by the maximum likelihood estimators 
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(Section 3.2) of р and X, 
(3) йо = x, 


(4) 


^ 1 № 
а= уу X (x, -7x)(x, X). 


а=1 


When p = pp, the likelihood function is maximized at 


^ 1“ 
(5) P N X (x, 7 Po) (La = Ko) 
а=1 
by Lemma 3.2.2. Furthermore, by Lemma 3.2.2 
1 1 
(6) max L(p, X = — м 
Xa ( ) (20) ?"|$ ol 
. i | 
(7 тах L(p,, X = —— e É 
) £ ( 0 ) (20)? "13, iN 
Thus the likelihood ratio criterion is 
$ iN убх РИ 
(8) A= al = (х, х) (х, х) | - 
15“ 100, Bo) Cx, — n9) I?" 
_ 412% 
А+ N(x — ш)(х- uo) ^ 
. where 
N 
(9) A= Y, (x, -3)(x, -X)' -(N-1)5. 
а= 1 


Application of Corollary А.3.1 of the Appendix shows 


yw IA REN 
(10) А а + [VN (x - шо) [УМ (x - во) |" 
acl ИВ 
Д 1+ N(X— во)" А (X — во) 
1 


 1-T?/(N-1)! 


where 


Q1) T?-N(X-gg)'S (X- во) = (N - 1) N(X — Bo) A7 (X ро). 
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The likelihood ratio test is defined by the critical region (region of 
rejection) 


(12) à € Àg, 


where Ag is chosen so that the probability of (12) when the null hypothesis is 
true is equal to the significance level. If we take the iNth root of both sides 
of (12) and invert, subtract 1, and multiply by N — 1, we obtain 


(13) Tz Tj, 
where 
(14) Tj =(N-1)(45”" ~ 1). 


Theorem 5.2.1. The likelihood ratio test of the hypothesis р = for the 
distribution N(p, X.) is given by (13), where T? is defined by (11), x is the mean 
of a sample of N from N(q., X), S is the covariance matrix of the sample, and Т; 
is chosen so that the probability of (13) under the null hypothesis is equal to the . 
chosen significance level. 


The Student t-test has the property that when testing и = 0 it is invariant 
with respect to scale transformations. If the scalar random variable X is 
distributed according to № р, с>), then X* = cX is distributed according to 
N(cp, c^a ?), which is in the same class of distributions, and the hypothesis 
EX = 0 is equivalent to €X* = cX = 0. If the observations x, are trans- 
formed similarly (x* = cx,), then, for c» 0, ¢* computed from x% is the 
same as t computed from x,. Thus, whatever the unit of measurement the 
statistical result is the same. 

The generalized T?-test has a similar property. If the vector random 
variable X is distributed according to N(p, £), then X* = СХ (for |С| + 0) is 
distributed according to N(Cp., CX C"), which is in the same class of distribu- 
tions. The hypothesis £ X = 0 is equivalent to the hypothesis 2Х* = CX = 0. 
If the observations x, are transformed in the same way, хо = Cx,, then T** 
computed on the basis of x% is the same as T? computed on the basis of x,. 
This follows from the facts that x* = Cx and А = САС’ and the following 
lemma: 


Гетта .5.2.1. For any p xp nonsingular matrices C and Н und any 
vector k, 


(15) k'H^k-(Ck)'(CHC') (Ck). 
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Proof. The right-hand side of (15) is 
(16) (Ck)'(CHC') (Ck) = &'C'(C') ! HC Ck 
-k'Hk. и 
We shall show in Section 5.6 that of all tests invariant with respect to such 
transformations, (13) is the uniformly most powerful. 


We can give a geometric interpretation of the Nth root of the likelihood 
ratio criterion, 


ES (x, - E) (x, - 3)'l 
17 А/М = а= |] w a А 
am IE (x, mo), њо) 


in terms of parallelotopes. (See Section 7.5.) In the p-dimensional represen- 
tation the numerator of A?/" is the sum of squares of volumes of all 
parallelotopes with principal edges p vectors, each with one endpoint at x 
and the other at an x,. The denominator is the sum of squares of volumes of 
all parallelotopes with principal edges p vectors, each with one endpoint at 
ро and the other at x,. If the sum of squared volumes involving vectors 
emanating from x, the "center" of the x,, is much less than that involving 
vectors emanating from p, then we reject the hypothesis that мо is the 
mean of the distribution. 

There is also an interpretation in the N-dimensional representation. Let 


y; = Gay... хм)’ be the ith vector. Then 

Ws- Yd 
18 МХ. = -= xX; 
(18) Ege. 


is the distance from the origin of the projection of у; on the equiangular line 
(with direction cosines 1/ YN ,...,1/ VN). The coordinates of the projection 
are (x,...,x)). Then (xi —X,..., x; — X) is the projection of y; on the 
plane through the origin perpendicular to the equiangular line. The numera- 
tor of A" is the square of the p-dimensional volume of ће parallelotope 
with principal edges, the vectors (x; —Xj,...,x;y — Xj). A point (x4— 
Hoi» -+> Xin — Hoi) is obtained from у; by translation parallel to the equiangu- 
lar line (by a distance VN jy;). The denominator of A?" is the square of the 
volume of the parallelotope with principal edges these vectors. Then A?/" is 
the ratio of these squared volumes. 


5.2.2. The Distribution of T? 


In this subsection we will find the distribution of 7? under general condi- 
tions, including the case when the null hypothesis is not true. Let T? = Y'S^! Y 
where Y is distributed according to N(v, X) and nS is distributed indepen- 
dently as 225.,Z, Z, with Z,,...,Z, independent, each with distribution 
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N(0, X) The T? defined in Section 5.2.1 is a special case of this with 
Y= УМ (X — шо) and v= VN(p—p,) and n = N— 1. Let D bea nonsingu- 
lar matrix such that РУО’ = I, and define 


(19) Y*=DY, S*=DSD', v*-Dv. 


Then T? = Y*'S*^!Y* (by Lemma 5.2.1), where Y* is distributed according 
to N(v*,I) and nS* is distributed independently as D"%_,Z*Z*' = 
YL-DZ.GODZ,) with the Z* = DZ, independent, each with distribution 
N(0, I). We note v'Z^!v = y* (1)! v* = v*'y* by Lemma 5.2.1. 

Let the first row of a p X p orthogonal matrix Q be defined by 


y* , 
(20) du 7 pe і= 1,...,р; 


this is permissible because УР. 192 = 1. The other p — 1 rows can be defined 
by some arbitrary rule (Lemma A.4.2 of the Appendix). Since Q depends on 
Y*, it is a random matrix. Now let 
= * 
(21) 0=01", 
В = QnS* Q'. 
From the мау О was defined, 


U, = Уд" = VY*'Y*, 


(22) 
U, = Xq,Y* = VY*'Y* Хара, = 0. j*1. 
Then 
pil р ... pie U; 


2 b? р... pP 0 
(23) T. 2 UB?U = (U,0,...,0) . . . 
b? pP? .. ррр 0 

= U?b" 


where (b")-— B^. By Theorem A33 of the Appendix, 1/b! =b, — 
bo B; Ва) = 11. р» Where 


by bay 
bu By. 


(24) в- | 


and T?/n = (РУБ... „= Y" 'Y*/b,., р. The conditional distribution of 
B given () is that of УЛ ИИ, where conditionally ће И, = QZ* are 


ty 
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independent, each with distribution №0, Г). By Theorem 4.3.3 Biya. p 18 
conditionally distributed as 57-07-72, where conditionally the W, are 
independent, each with the distribution N(0, 1); that is, bi1.2,...,p is Condi- 
conditional distribution of bi; р does not depend on 0, it is uncondition- 
ally distributed as y?. The quantity Y*'Y* has a noncentral x?-distribution 
with p degrees of freedom and noncentrality parameter v*'v* —y'€-ly, 
Then T?/n is distributed as the ratio of a noncentral X? and an independent 


x’. 


Theorem 5.2.2. Let T?=Y’S~'Y, where Y is distributed according to 
N(v,%) and nS is independently distributed as Y" Z, Z, with Z,,...,Z, 
independent, each with’ distribution N(0, X). Then (T?/n)[n — р+1/р] is 
distributed as а noncentral F with p and n —p + 1 degrees of freedom and 


noncentrality parameter v'X, v. If v = 0, the distribution is central Е. 
We shall call this the T?-distribution with и degrees of freedom. 


Corollary 5.2.1. Let xj,..., xy bea sample from N(p, X), and let T? = 
№2 — po)'S (x ро). The distribution of [T?/(N — ПКМ — p)/p] is non- 
central F with p and N — p degrees of freedom and noncentrality parameter 
Мур — р)" 7 (p. — ро). If = p, then the F-distribution is central. 


The above derivation of the T?-distribution is due to Bowker (1960). The 
noncentral F-density and tables of the distribution are discussed in Section 
5.4. 

For large samples the distribution of T? given by Corollary 5.2.1 is 
approximately valid even if the parent distribution is not normal; in this sense 
the T?-test is a robust procedure. 


Theorem 5.2.3. Let {X,}, а= 1,2,..., be a sequence of independently 
identically distributed random vectors with mean vector p. and covariance matrix 
X; let X, - Q/N)EN Х,, Sy = /O — DIE t, —- XXX, — X4), and 
Ty = NX, — о) S (Xy — во). Then the limiting distribution of Tg as 
N > оо is the x^ distribution with p degrees of freedom if р. = py. 


.Proof. By the central limit theorem (Theorem 4.2.3) ihe limiting distribu- 
tion of VN(Xy — р) is №0, £). The sample covariance matrix converges 
stochastically to X}. Then the limiting distribution of TZ is the distribution of 
Y'Z-!Y, where Y has the distribution N(0, X). The theorem follows from 
Theorem 3.3.3. L| 
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When the null hypothesis is true, 7?/n is distributed as X; / Хрл , and 
AV" given by (10) has the distribution of Xiph ИО pui + ҳр). The 
density of V = x2/( x2 + x2), when x; and xj are independent, is 


(25) C pieta caf = (v; за, ib) 


А А 4 1 
this is the density of the beta distribution with parameters за and 3b 


(Problem 5.27). Thus the distribution of A7" = (1-- Т/п)! is the beta 
distribution with parameters 3p and 3(n — p + 1). 


5.3. USES OF THE T?-STATISTIC 


5.3.1. Testing the Hypothesis That the Mean Vector Is a Given Vector 

The likelihood ratio test of the hypothesis p = p. on the basis of a sample of 
М from N(p, X) is equivalent to 

(1) T? > T? 


as given in Section 5.2.1. If the significance level is a, then the 100a% point 
of the F-distribution is taken, that is, 


(2) r? = Or Dr, (a) = T2 a). 


say. The choice of significance level may depend on the power of the test. We 
shall discuss this in Section 5.4. И 

The statistic Т? is computed from x and A. The vector A (x - ру) = b 
is the solution of Ab — X — py. Then T?/(N 1) = МХ — ро). 

Note that T2/(N — 1) is the nonzero root of 


(3) IN($ ра) (#— во)" - AI = 0. 


Lemma 5.3.1. If v is a vector of p components and if B is a nonsingular 
p Xp matrix, then v' B ! v is the nonzero root of 


(4) [|vv' — AB| = 0. 


Proof. The nonzero root, say A,, of (4) is associated with a characteristic 
vector В satisfying 


(5) vv'B = ABB. 
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mo 
. ) 
(1,502) 
— ті 


Figure 5.1. A confidence ellipse. 


Since A, + 0, v’B + 0. Multiplying on the left by v'B^!, we obtain 
(6) (»'В-15) (р В) -^(в). m 


In the case above v = YN (X — po) and В = А. 


5.3.2. A Confidence Region for the Mean Vector 


If р is the mean of M(p, X), the probability is 1 — а of drawing а sample of 
N with mean х and covariance matrix S such that 


(7) N(x -u)'S"(x-p) <T? y, (a). 


Thus, if we compute (7) for a particular sample, we have confidence 1 ~ a 
that (7) is a true statement concerning p. The inequality 


(8) N(x-m)'S (xm) <Т2 (а) 


is the interior and boundary of an ellipsoid in the p-dimensional space of m 
with center at X and with size and shape depending on S^! and o. See 
Figure 5.1. We state that р, lies within this ellipsoid with confidence 1 — a. 


Over random samples (8) is a random ellipsoid. 
5.3.3. Simultaneous Confidence Intervals for All Linear Combinations 
of the Mean Vector 


From the confidence region (8) for p. we can obtain confidence intervals for 
linear functions y'm that hold simultaneously with a given confidence coeffi- 
cient. 


Lemma 5.3.2 (Generalized Cauchy-Schwarz Inequality). For a positive 
definite matrix S, 


(9) (YYY €v'Syy'S7!y. 
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Proof. Let b= -y'y/y'Sy. Then 


(10) 0< (y - bSy'S- (y — bSv) 
—y'S7ly — bq'SS7!y –у'5715үр +624 '55-15 у 
ГА РА 2 
нес _ (Y3). 
YS y- Sy 
which yields (9). L| 


When y —X — p, then (9) implies that 


(11) IY (X в) € Vy'Sy(#- p)'S- (X ы) 
S VY'Sy VT w-i(a)/N 


holds for all y with probability 1 — а. Thus we can assert with confidence 
1— a that the unknown parameter vector satisfies simultaneously for all Y 
the inequalities 


(12) Iv'x - y'm| S VY'Sy VT? y .(a)/N. 


The confidence region (8) can be explored by setting y in (12) equal to 
simple vectors such as (1,0,...,0)' to obtain ть, (1,— 1,0,...,0) to yield 
m, — тз, and so on. It should be noted that if only one linear function y’p 


were of interest, y Tr. (a) = VnpF, „_„+:(а)/(п-р+1) wouli be 


replaced by £,(a). 


5.3.4. Two-Sample Problems 


Another situation in which the T^-statistic is used is one in which the null 
hypothesis is that the mean of one normal population is equal to the mean of 
the other where the covariance matrices are assumed equal but unknown. 
Suppose y{”,..., yf? is a sample from N(p(?, X), i = 1,2. We wish to test the 
null hypothesis p=», The vector УФ is distributed according to 
Nip, 0/N)X]. Consequently YN, N;/(N, + N;) (90 — $9) is distributed 
according to N(0, X» under the null hypothesis. If we let 
1 У 

= L———L——— (0) _ (D D 50y 

(13) S= FEN =a È (5 y (> y ) 


а= 


+ E (8-59) 09-59), 


а=1 
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then (№, + М, — 2)S is distributed as м+м 


2-27 Z . . ; 
according to N(0, X). Thus об, Where Z, is distributed 


N,N, 
(14) Т? = N, T N, (5 -yOys-(y0 -y?) 


is distributed as T? with N, +N, 


М ~ 2 degrees of freedom. The critical region 


М+М, – 2 
(15 т> (M+, -2)p 
) МЕМ р Гр м+м, -›-1(а) 


with significance level a. 
A confidence region for pO — p® wi i 
ш with confidence level 1 — a i 
vectors m satisfying a 1s she set of 


(16) (99-99 — my's- (30 — 52 — т) 


< М+№ т» 
~ NN, р. м+м, -2 (4) 


№№, (м+м 2)р 
NN, М+М, =p- Тр, м чм, -р-1(@). 


Simultaneous confidence intervals are 


(17) ly! $) zD) an’ 7 N, +N, 
(572-72) — y'm| < Vy'Sy ММ, Тр м+м, -2(а). 


Ап example may be taken from Fisher (1936). Let x, =sepal length 
хз sepal width, x, = petal length, X4 = petal width. Fifty observations are 
taken from the population Jris versicolor (1) and 50 from the population /ris 
setosa (2). See Table 3.4. The data may be suminarized (in centimeters) as 


5.936 
(18) x) = 2.770 | 

4.260 |" 

1.326 


5.006 
(19) #0) = | 3.428 

1.462 |' 

0.246 
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19.1434 9.0356 9.7634 3.2394 
9.0356 11.8658 4.6232 2.4746 
9.7634 4.6232 12.2978 3.8794 
3.2394 2.4746 3.8794 2.4604 


(20) 985 = 


The value of T?/98 is 26.334, апа (Т7: /98) x % = 625.5. This value is highly 
significant compared to the F-value for 4 and 95 degrees of freedom of 3.52 
at the 0.01 significance level. 

Simultaneous confidence intervals for the differences of component means 
pO — р, 1=1,2,3,4, are 0.930 + 0.337, —0.658 + 0.265, —2.798 + 0.270. 
and 1.080 + 0.121. In each case 0 does not lie in the interval. [Since 1,,(.01) < 
T, (0.01), a univariate test on any component would lead to rejection of the 
null hypothesis.] The last two components show the most significant differ- 
ences from 0. 


5.3.5. A Problem of Several Samples 


After considering the above example, Fisher considers a third sample drawn 
from a population assumed to have the same covariance matri. He treats the 
same measurements on 50 Iris virginica (Table 3.4). There is a theoretical 
reason for believing the gene structures of these three species to be such that 
the mean vectors of the three populations are related as 


(21) Зр zu + 2p, 


where p® is the mean vector of the third population. 

This is a special case of the following general problem. Let (x£?). а = 
1,...,№, i=1,...,g, be samples from N(g?, X), i= 1,..., q, respectively. 
Let us test the hypothesis 


| 4 
(22) H:Y, Bu? =p, 
i=l 
where f,..., B, are given scalars and p is a given vector. The criterion is 
4 | ' q 
(23) T? =e L B,x? -pujs | X Bx = «| 
i=] i=l 


where 


(24) мхи, 
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d 4 H А К А 
eo ане Remo 
у В 
(26) = = Y A.. 


This 7° has the T "distribution with 59, N, — q degrees of freedom. 

Fisher actually assumes in his example that the covariance matrices of the 
three populations may be different. Hence he uses the technique described in 
Section 5.5. 


5.3.6. A Problem of Symmetry 


Consider testing the hypothesis Н: ш = р =: =m, on the basis of a 
sample x... Xy from N(p, X), where в’ — (ш... Hp) Let C be any 
(p — DX p matrix of rank p — 1 such that 


(27) Cc - 0, 
where =’ =(1,...,1). Then 
(28) Ya = Сха, а = 1,..., М, 


has mean Си and covariance matrix CXC’. The hypothesis Н is Cp = 0. 
The statistic to be used is 


(29) Т? = Ny'S^!y, 
where 

1 N 
(30) j= En G, 


1 N 
(3) S= ут X QO.-0.-»» 
а= | 


bone - =). 
= Nate & (aT (5,8) С’. 


This statistic has the 7^?-distribution with № — 1 degrees of freedom for а 
(р- 1)-dimensional distribution. This T?-statistic is invariant under any 


linear transformation in the p — 1 dimensions orthogonal to є. Hence the ` 


statistic is independent of the choice of C. 

An example of this sort has been given by Rao (1948b). Let N be the 
amount of cork in a boring from the north into a cork tree; let E, S, and W 
be defined similarly. The set of amounts in four borings on one tree is 
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considered as an observation from a 4-variate normal distribution. The 
question is whether the cork trees have the same amount of cork on each 
side. We make a transformation 


y N-E-W-S, 
(32) у=5-Й’, 
y; N-8. 


The number of observations is 28. The vector of means is 


(33) y- 


the covariance matrix for y is 


(34) S-| 6141 5693 -2830 


| аи 61.41  —21.02 
—21.02  —2830 63.53 


The value of T?/(N — 1) is 0.768. The statistic 0.768 x 25/3 — 6.402 is to be 
compared with the F-significance point with 3 and 25 degrees of freedom. It 
is significant at the 1% level. 


5.3.7. Improved Estimation of the Mean 


In Section 3.5 we considered estimation of the mean when the covariance 
matrix was known and showed that the Stein-type estimation based on this 
knowledge yielded lower quadratic risks than did the sample mean. In 
particular, if the loss is (т — p. E (т — p), then 


" 
p-2 = 
35 1 —————L-—— —v)-4 
(35) N(€-vyX-(X-v) (ov) +» 
is a minimax estimator of р, for any v and has a smaller risk than X when 
p = 3. When X is unknown, we consider replacing it by an-estimator, namely, 
a multiple of A = nS. 


Theorem 5.3.1. When the loss is (m — p)! E ^ (m — р), the estimator for 
p 23 given by 


(36) (i yan) 


has smaller risk than x and is minimax for 0 « a «(p —2)/(n — p +3), and 
the risk is minimized for a = (p — 2)/(n — p + 3). 
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Proof. As in the case when X is known (Section 3.5.2), we can make a 
transformation that carries (1/N)X, to Г. Then the problem is to estimate р. 
based on Y with the distribution N(p,I) and A= 15.12,Z2, where 
Zy..., Z, are independently distributed, each according to NO. I ), and the 


loss function is (m — џ) (т — p). (We have dropped a factor of N.) The. 


difference in risks is 


(37) 


AR(p) = sir- ш - 


l 


li- roire- v-a 


a р 
- чит (Y; = )(и- v;) 


a? 


——— ——— r-r}. 
[y- ya y- у 


The proof of Theorem 5.2.2 shows that (Y — v)'A^(Y — v) is distributed as 


ly ~ i /Xi- 41; Where the y? is i f I 
Y v 1? X, is independent of Y. i - 
in ti p п-р+1 P hen the differ 


ПУ 22 £ 2 


2 2 . 2 
(38) AR(p) = «nus у (ии) — Lua) | 
=1 ПУ vi? 


И ПУ? —— IY — vi? 


= (20р - 2)(n - p 0a 


— _ 2 1 
[2(n pt+l1)+(n-p+1) е) аур 


The factor in braces is n—p+1 times 2(p — 2)a — (n — p + З)а?, which 


is positive for 0 <а «Xp— 2)/(n —p 3) and is maximized for a= 
(p – 2)/(n — p + 3). и 


The improvement over the risk of Y is (n — p + 1)(р – 2?/(n —p+3)- 
é,\l¥ — vll ^, as compared to the improvement (р — 2} £ IY — vl? of m(y) 
of Section 3.5 when X is known. " 
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Corollary 5.3.1. The estimator for p > 3 


— N(X-v)'A- (X-v) 


has smaller risk than (36) and is minimax for 0 «a < Xp – 2) (п -p + 3. 


(39) 1 It у) +у 


Proof. This corollary follows from Theorem 5.3.1 and Lemma 3.5.2. и 


The risk of (39) is uot necessarily minimized at a = (p —2)/(n — р + 3), 
but that value seems like a good choice. This is the estimator (18) of Section 
3.5 with & replaced by [1/(n — p + 3)]A. 

When the loss function is (т — )'Q(m— p), where О is an arbitrary 
positive definite matrix, it is harder to present a uniformly improved estima- 
tor that is attractive. The estimators of Section 3.5 can be used with X 
replaced by an estimate. 


5.4. THE DISTRIBUTION OF 7? UNDER ALTERNATIVE 
HYPOTHESES; THE POWER FUNCTION 


In Section 5.2.2 we showed that (T^ /nYN — p)/p has а noncentral F-distri- 
bution. In this section we shall discuss the noncentral F-distribution. its 
tabulation, and applicatións to procedures based on T°. 

The noncentral F-distribution is defined 1$ the distribution of the ratio of 
а noncentral у? and an independent y? divided by the ratio of correspond- 
ing degrees of freedom. Let V have the noncentral y*-distribution with p 
degrees of freedom and noncentrality parameter 7^ (as given in Theorem 
3.3.5), and let W be independently distributed as x^ with m degrees of 
freedom. We shall find the density of F = (И/р) /(И'/т), which is the 
noncentral F with noncentrality parameter 7^. The joint density of Г and W 
is (28) of Section 3.3 multiplied by the density of W, which is 

-inp-i(Im)w?"-!e- №, The joint density of F and W (du = pwdf/m) is 

(1) 


2 
= 172 


е? е- P +р//т) 


23 o* "D Г(3т) 
x 24A tp+B- 
ey (7) — | (2y ЧИ 
т B=0 4 BIT(sp+B) m 
The marginal density, obtained by integrating (1) with respect to w from 0 
to oo, is 


(2) p у (72/2) (рут) Грат) +В] 
тГ(5т) B=0 BIT(ips gy + pf/m) "8 
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Theorem 5.4.1. If V has а noncentral x?-distribution with p degrees of 
freedom and noncentrality parameter 72, and W has an independent x? -distribu- 
tion with m degrees of freedom, then Е =(V/p)/( W/m) has the density (2). 


The density (2) is the density of the noncentral F-distribution. 

If T? = МЕ — po) STIE — во) is based on a sample of N from Му, X), 
then (77 /nXN — p)/p has the noncentral F-distribution with p and N- p 
degrees of freedom and noncentrality parameter Мр — p Кр = Bo) = 
т?. From (2) we find that the density of T? is 


еті" = (stay [20м 2]? TGN+ 8) 
O WDA Р) асо pir aya гии DP" 


1.2 тр 
e [ward] 
where 


e T(a-* 8)I(b)x? 


The density (3) is the density of the noncentral T?-distribution. 

Tables have been given by Tang (1938) of the probability of accepting the 
null hypothesis (that is, the probability of Type II error) for various values of 
т? and for significance levels 0.05 and 0.01. His number of degrees of 
freedom f, is our p [1(D8], his f; is ourn -p t1 [2, 4(1)30, 60, oo], and his 
noncentrality parameter ф is related to our т? by 


T 


(5) ф= УР+1 


[13418]. His accompanying tables of significance points are for T?/(T? + 
М- 1). 

As an example, suppose p = 4, п-р+1= 20, and consider testing the 
null hypothesis p = 0 at the 1% level of significance. We would like to know 
the probability, say, that we accept the null hypothesis when ф = 2.5 (r? =: 
31.25). И is 0.227. If we think the disadvantage of accepting the null 
hypothesis when N, p, and X are such that 7? — 31.25 is less than the dis- 
advantage of rejecting the null hypothesis when it is true, then we may find it 
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reasonable to conduct the test as assumed. However, if the disadvantage of 
one type of error is about equal to that of the other, it would seem reason- 
able to bring down the probability of a Type II error. Thus, if we use a 
significance level of 5%, the probability of Type II error (for ф = 2.5) is only 
0.043. 

Lehmer (1944) has computed tables of $ for given significance level and 
given probability of Type II error. Here tables can be used to see what value 
of 7? is needed to make the probability of acceptance of the null hypothesis 
sufficiently low when p + 0. For instance, if we want to be able to reject the 
hypothesis р, = 0 on the basis of a sample for a given p and X, we may be 
able to choose N so that Np’! = 7? is sufficiently large. Of course, the 
difficulty with these considerations is that we usually do not know exactly the 
values of м, and X (and hence of 7?) for which we want the probability of 
rejection at a certain value. 

The distribution of T? when the null hypothesis is not true was derived by 
different methods by Hsu (1938) and Bose and Roy (1938). 


5.5. THE TWO-SAMPLE PROBLEM WITH UNEQUAL 
COVARIANCE MATRICES 


If the covariance matrices аге not the same, the T-test for equality of mean 
vectors has a probability of rejection under the null hypothesis that depends 
on tbese matrices. If the difference between the matrices is small or if the 
sample sizes are large, there is no practical effect. However, if the covariance 
matrices are quite different and/or the sample sizes are relatively small, the 
nominal significance level may be distorted. Hence we develop a procedure 
with assigned significance level. Let (x), о = 1,..., №, be samples from 
Мио, Xj), i= 1,2. We wish to test the hypothesis H : p” =p. The mean 
x of the first sample is normally distributed with expected value 


(1) ex? = po 
and covariance matrix 


А А ‚_ 1 
(2) & (EM — 0) (я — ро) xe 


Similarly, the mean х0) of the second sample is normally distributed with 
expected value 


(3) EIO = yp 
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and covariance matrix 
m Le , 1 
(4) é(x? ~ (xc — ә) = №: 22. 


YO. 1 
Thus x —~z@ has mean p” -—p® and covariance matrix (1/N,)X, + 
(1/N,)%,. We cannot use the technique of Section 5.2, however, because 


2 


N. 
(5) (x? — #0) (2 - #0)! + Y (x2 - 22))(x@ — x) 
а=} 


a=) 


does not have the Wishart distribution with covariance matrix a multiple of 
(1/N)£, + (ИМ, У. 

If N, 2 №, = №, say, we can use the T*test in an obvious way. Let 
Yq = х0) – х0) (assuming the numbering of the observations in the two 
samples is independent of the observations themselves). Then y, is normally 
distributed with mean в“ – и ‘and covariance matrix У +5, апа 


Ji. Ум аге independent. Let y = (1/N)EN. у, = x? — х0), and define S 
by Í t 


N 
(60 (N-1)S= У (Ye =0)0 9) 


N 
= У (x _ x — #0 * x9)(x _ х0 — а +50). 


а=1 


Then 
(7) T?-Ny'S-y 


is suitable for testing the hypothesis p(? — р = 0, and has the T?-distribu- 
tion with N — 1 degrees of freedom. It should be observed that if we had 
known X, =Х,, we would have used a T?-statistic with 2N — 2 degrees of 
freedom; thus we have lost N — 1 degrees of freedom in constructing a test 
which is independent of the two covariance matrices. If N, = М, = 50 as in 
the example in Section 5.3.4, then 72 49(.01) = 15.93 as compared to Tjo(.01) 


Now let-us turn our attention to the case i 
se of М, + №,. For corvenience, let 
М, < N,. Then we define | 


№ N, N. 
(8-х Y N, xoc. y op- Y x9, @=1,...,N. 
2 в №, green ANY 

ye 


УМ: №» ви 7? 
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The expected value of y, is 


N N N, 
9 бу = pO — X pO + Ы = 52 2 = gl) — yl, 
(9) J=- үт в мм NM 8-м 


The covariance matrix of y, and yg is 
, N, 
(10) (у. £y.) 0s — £5)! = | Xv + М, 2] 


Thus a suitable statistic for testing p? — p? = 0, which has the T *-distribu- 
tion with N, — 1 degrees of freedom, is 


(11) T?^-NyS^y, 
where 
1 м 
(12) у= Lye 39 
А l a=} 
and 


М, N, 
(13) (мМ-15= У (o-0.'7 У (а. ci) и), 


where й = (1/N)EM и, and u, = xt? — yN,/N;x£, а = 1,..., №. 

This procedure was suggested by '"Scheffé (1943) in the univariate case. 
Scheffé showed that in the univariate case this technique gives the shortest 
confidence intervals obtained by using the ¢-distribution. The advantage of 
the method is that х0 — xÓ is used, and this statistic is most relevant to 
p® — рО). The sacrifice of observations in estimating а covariance matrix is 
not so important. Bennett (1951) gave the extension of the procedure to the 
multivariate case. 

This approach can be used for more general cases. Let {x}, а= 1,..., №, 
j — 1,...,9, be samples from Ми, X), {= 1,...,9, respectively. Consider 
testing the hypothesis 


4 
(14) н: У Bip? = в, 
i=l 


where fi,...., are given scalars and p is a given vector. If the N; are 
unequal, take N, to be the smallest. Let 


N, 


4 Ni | a 1 y ; l pyi 
15) вао + E Bal Ht [28 Wy Ll + TELS 
05) N № p= N,N, y=1 7 


i=2 
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Then &y,= Y?., B^, and 


a BPN 
(16) го, 63057 6) - Х «x. 


Let y and S be defined by 


N м 

sd wv NC Dod (i) 

(17) ION, L- Ls , X (Nj „287 
a-1 i- - 


№ 
08 — (M-DS» У (x -(G.-3)- 
a=l 


Then 
(19) T?-N(y-u)'S (ув) 


is suitable for testing H, and when the hypothesis is true, this statistic has the 
T?.distribution for dimension p with №, — 1 degrees of freedom. If we let 
u, = EL ByN /N; xD, а=1,..., М, then S can be defined as 


N, 
(20) (мМ-1)5= У (и, –й)(и, – й)". 
а= 


Another problem that is amenable to this kind of treatment is testing the 
1 2)y 
hypothesis that two subvectors have equal means. Let x = (х! » xc >) be 
distributed normally with mean p = (p®', p")’ and covarience matrix 


Xu X» 
(21) :-| | 


We assume that x and x are each of q components. Then у = x‘) — х?) 
is distributed normally with mean p” — и and covariance matrix i, - Хи 
—€,-EX,- EX. То test the hypothesis p® = p? we use a T*-statistic 
Ny'S; y, where the mean vector and covariance matrix of the sample are 
partitioned similarly to р and X. 


5.6. SOME OPTIMAL PROPERTIES OF THE T?-TEST 


5.6.1. Optimal Invariant Tests 


In this section we shall indicate that the 7?-test is the best in certain classes 

of tests and sketch briefly the proofs of these results. l 
The hypothesis p = 0 is to be tested on the basis of the N observations 

x,....Xy Нот №, X). First we consider the class of tests based on the 
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statistics A = Y(x, —x)(x, —x) and X which are invariant with respect to 
the transformations A* = CAC’ and x* = Cx, where C is nonsingular. The 
transformation х* = Cx, leaves the problem invariant; that is, in terms of x; 
we test the hypothesis €x% = 0 given that x*,...,x% аге М observations 
from a multivariate normal population. It seems reasonable that we require a 
solution that is also invariant with respect to these transformations; that is, 
we look for a critical region that is not changed by a nonsingular linear 


transformation. (The defin.tion of the region is the same in different coordi- 
nate systems.) ' 


Theorem 5.6.1. Given the observations Xo... Xy Лот Муи, X), of all 
tests of p=0 based on x and А = Y(x, — 0х, —X)' that are invariant with 
respect to transformations x* — Cx, A* — CAC' (C nonsingular), the T?-test is 
uniformly most powerful. 


Proof. First, as we have seen in Section 5.2.1, any test based on Т? is 
invariant. Second, this function is essentially the only invariant, for if f(x, А) 
is invariant, then f(x, 4) = f(x*, I), where only the first coordinate of x* is 
different from zero and it is Vx'4^!x. (There is a matrix C such that 
Cx =х* and САС’ =I.) Thus f(¥, A) depends only on X'4-!x. Thus ап 
invariant test must be based on х'А-'х. Third, we can apply the Neyman- 
Pearson fundamenta! lemma to the distribution of T? [(3) of Section 5.4] to 
find the uniformly most powerful test based on 7? against a simple alterna- 
tive 7? = Nuy'X^! p. The most powerful test of т? = 0 is based on the ratio of 
(3) of Section 5.4 to (3) with т? = 0. The critical region is 


(1) 


= iG 1)-a 


а y (72/2) (п) (1+ ум) 
exe" Уу art 
aco e T($3p +a) 


T[s(n +1) +a] 


[ere чт) +] 


Г(2р) 
= ТОР) -ir y (72/2) T[3(n & 1) + а] Ss \“ 
T[i(n +1] dco all (ip +a) Туй 


The right-hand side of (1) is a strictly increasing function of Q?/n)/( t? /n), 
hence of ¢?. Thus the inequality is equivalent to £? > К for k suitably chosen. 


Since this does not depend on the alternative 7?, the test is uniformly most 
powerful invariant. u 
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Definition 5.6.1. A critical function (x, А) is a function with values 


between 0 and 1 (inclusive) such that &J(X, A) = e, the significance level, when 
и=0. 


А randomized test consists of rejecting the hypothesis with probability 
ф(х, В) when х=х and A=B. A nonrandomized test is defined when 
w(x, A) takes on only the values 0 and 1. Using the form of the 
Neyinan-Pearson lemma appropriate for critical functions, we obtain the 
following corollary: 


Corollary 5.6.1. On the basis of observations x,,..., xy from Му, X), of 
all randomized tests based on X and A that are invariant with respect to 
transformations x* — Cx, A* — CAC' (C nonsingular), the T?-test is uniformly 
most powerful. 


Theorem 5.6.2. On the basis of observations x,,..., xy from Му, X), of 
all tests of р. = 0 that are invariant with respect to transformations x* = Cx, 
(€ nonsingular), the T?-test is a uniformly most powerful test; that is, the T?-test 
is at least as powerful as any other invariant test. 


Proof. Let y(x,,..., xy) be the critical function of an invariant test. Then 
(2) £o Хы] = alelet... х), Al}. 


Since х, 4 are sufficient statistics for р, €, the expectation é[y(x,,..., 
Xy)|X, A] depends only on x, A. It is invariant and has the same power as 
Ga... xy). Thus each test in this larger class can be replaced by one in 
the smaller class (depending only оп x and А) that has identical power. 
Corollary 5.6.1 completes the proof. и 


Theorem 5.6.3. Given observations x,,..., xy from N(p, X), of all tests of 
и = 0 based on x and А = Y(x, — 0х, —x)' with power depending only оп 
Ny'X7! yu, the T?-test is uniformly most powerful. 


Proof. We wish to reduce this theorem to Theorem 5.6.1 by identifying the 
class of tests with power depending on №’ р with the class of invariant 
tests. We need the following definition: 


Definition 5.6.2. А test (x,,..., xy) is said to be almost invariant if 
(3) Ч (хи, Xn) = V(Cx,,..., Cu) 
for all xy,..., xy except for a set of xy,..., xy of Lebesgue measure zero; this 


exception set may depend on C. 
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It is clear that Theorems 5.6.1 and 5.6.2 hold if we extend the definition of 
invariant test to mean that (3) holds except for a fixed set of .:,,.... x, of 
measure 0 (the set not depending on С). И has been shown by Hunt and 
Stein [Lehmann (1959)] that in our problem almost invariance implies invari- 
ance (in the broad sense). 

Now we wish to argue that if (x,.4) has power depending only on 
Np’ X^! p, it is almost invariant. Since the power of (x, А) depends only on 
Ми’ У 1р, the power is 


(4) 6, x (X. A) = бе, сис. (X A) 


= 6, ;4( СЕ, CAC?). 


The second and third terms of (4) are merely different ways of writing the 
same integral. Thus 


(5) é, s I9 (X, А) — 4( СХ, САС')| =0, 


identically in p, X. Since x, А are a complete sufficient set of statistics for 
і, X (Theorem 3.4.2, f(x, А) = W(x, А) — 9 (Cx, CAC’) = 0 almost every- 
where. Theorem 5.6.3 follows. и 


As Theorem 5.6.2 follows from Theorem 5.6.1, so does the following 
theorem from Theorem 5.6.5: 


Theorem 5.6.4. On the basis of observations x,,..., xy from Муи. X). of 
all tests of p=0 with power depending only on Мур, the T74est is a 
uniformly most powerful test. 


Theorem 5.6.4 was first proved by Simaika (1941). The results and proofs 
given in this section follow Lehmann (1959). Hsu (1945) has proved an optimal 
property of the T?-test that involves averaging the power over p and X. 


5.6.2. Admissible Tests 


We now turn to the question of whether the T*-test is a good test compared 
to all possible tests; the comparison in the previous section was to the 
restricted class of invariant tests. The main result is that the 7 -test is 
admissible in the class of all tests; that is, there is no other procedure that is 
better. 


Definition 5.6.3. А test T* of the null hypothesis Ну: w € Q against the 
alternative w E€ €, (disjoint from О.) is admissible if there exists no other test Т 
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such that 

(6) Pr{Reject Hol T, w} < Pr(Reject Hol T*, о}, € € (s, 
(7) Pr(Reject НИТ, о) > Pr(Reject НиТ», о}, o € (, 
with strict inequality for at least one w. 


The admissibility of the T?-test follows from a theorem of Stein (1956a) 
that applies to any exponential family of distributions. 

An exponential family of distributions (4, 8, m, О, P) consists of a finite- 
dimensional Euclidean space 2, a measure m on the c-algebra B of all 
ordinary Borel sets of 47, a subset © of the adjoint space “%’ (the linear 
space of all real-valued linear functions on Y) such that 


(8) y(o) = Је? dm(y) « o». e €, 
Y 
and P, the function on © to the set of probability measures on @ given by 
P(A) = ius f e" dm(y), AEB. 
ww) 24 


The family of normal distributions Му, X) constitutes an exponential 
family, for the density can be written | 


g WX 
7 "yd ту -1 D 
(9) n(xlp, £) = ее мис Exe, 
(21) 1 
We map from 210 4 the vector y = (у(0', 27)! is composed of y® =x 
and y? = (x2,2x,x4,..., 2x4 x,, X], ха). The vector o = (600°, e^) is 
_ ? 
composed of o? = Х-и and e^? = — Mo" o ?,... 0P, o... 077), 


where (0) = X^'; the transformation of parameters is one to one. The 
measure m(A) of a set A € Bis the ordinary Lebesgue measure of the sei of 
x that maps into the set. A. (Note that the probability measure in Y is not 
defined by a density.) l 


Theorem 5.6.5 (Stein). Let (4, 3, m, N, P) be an exponential family 
and (Y, a nonempty proper subset of О. (i) Let A be a subset of & that is closed 
` and convex. (ii) Suppose that for every vector w Є &' and real c for which 
(yl e y > c) and A are disjoint, there exists ey, Є О such that for arbitrarily large 
Л the vector œ, + Aw € Q — Ny. Then the test with acceptance region А is admis- 
sible for testing the hypothesis that о Е Q, against the alternative w € (Y — No. 
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wy>ec 


Figure 5.2 


The cond.tions of the theorem are illustrated in Figure 5.2, which is drawn 
simultaneously in the space @ and the set Q. 


Proof. The critical function of the test with acceptance region A is 


oy) = 0. y € A, and ф,(у) = 1, y € A. Suppose ф(у) is the critical function 
of a better test, that is, | 


(10) JOI) аР.(у) < Фау) аР.( у), e € fig, 


(11) JOI) аР.(у) = f 640) аР.( 5), e € 0 - 0, 
with strict inequality for some «; we shall show that this assumption leads to 


a contradiction. Let B = (y| ф(у) < 1). (if the competing test is nonrandom- 
ized, B is its acceptance region.) Then 


(12) {yléa(y) - 6(») »0) - An B, 


where A is the complement of A. The m-measure of the set (12) is positive; 
otherwise $4(y) = ф(у) almost everywhere, and (10) and (11) would hold 
with equality for all о. Since А is convex, there exists ап œ and a c such that 
the intersection of А ПВ and (y| o' y > c) has positive m-measure. (Since A 
is closed, A is open and it can be covered with a denumerable collection of 
ореп spheres, for example, with rational radii and centers with rational 
coordinates. Because there is a hyperplane separating A and each sphere, 
there exists a denumerable coilection of open half-spaces Н, disjoint from A 
that covers A. Then at least one half-space has an intersection with АПВ 
with positive m-measure.) By hypothesis there exists о; € О and an arbitrar- 
ily large A such that 


(13) | w = 9, + Ào € О — Q.,. 
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Then 
(14) f[e40) - $] 4,0) 


= Toy! [94() - 6(y)]e% dm( у) 


= TY f [6400 = APE de, (9) 


ИО -é0)]e'7? aP, (y) 


= е [640) - 6Q)e*" dn.) 


+ 190) - &Q)]ee7-nar, (y). 


For o'y > с we have $4(y) = 1 and $4(y) – Ф(У) > 0, and (yl 64(y) — ф(у) 
> 0} has positive measure; therefore, the first integral in the braces ap- 
proaches oo as A — со. The second integral is bounded because the integrand 


is bounded by 1, and hence the last expression is positive for sufficiently large 
А. This contradicts (11). E 


This proof was given by Stein (1956a). It is a generalization of a theorem 
of Birnhaum (1955). 


Corollary 5.6.2. If the conditions of Theorem 5.6.5 hold except that A is 
not necessarily closed, but the boundary of A has m-measure 0, then the 
conclusion of Theorem 5.6.5 holds. 


Proof. The closure of А is convex (Problem 5.18), and the test with 
acceptance region equal to the closure of А differs from A by a set of 
probability 0 for all о € Q. Furthermore, 


(15) Aníylo'y»c) -Q = Ac{ylo'y<c} 
= closure AC(ylo'y xc). 


Then Theorem 5.6.5 holds with A replaced by the closure of .4. и 


Theorem 5.6.6. Based on observations x,,...,xy from N(p, X) 
Hotelling’s T?-test is admissible for testing the hypothesis y. = 0. 


(9) 
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Proof. То apply Theorem 5.6.5 we put the distribution of the observations 
into the form of an exponential family. By Theorems 3.3.1 and 3.3.2 we can 
transform ху... Xy tO Za = У. C,5X5, where (Cag) is orthogonal and zy 
= /N X. Then the density of z,,..., zy (with respect to Lebesgue measure) is 


МИ е 
Sapien | es (FE) У zx 
T) 3 


а= і 
The vector у = (у, yO")' is composed of y 2z, (= УМХ) and у = 
(6;1,261,...,26,,,Б,-...0,,)', where 


ip: 


N N 
a7) ве Yoann |- È ai): 
a=] 


а=1 


_ 2) — 
The vector в = (607, wY is composed of e? = /NX^!& and e? = 
= ia, 0 9,.,., o 17, 0 7,..., 9РР). The measure m(A) is the Lebesgue 
measure of the set of z,,...,z, that maps into the set A. 
1 


Lemma 5.6.1. Let B =A + Nxx'. Then 


(18) МАТИ = 


Proof of Lemma. If we let В = А + УМхУМх' in (10) of Section 5.2, we 
obtain by Corollary A.3.1 


1 = АИ = |IB~VNXVN='| 
1+7?/(N-1) [BI 


(19) 
-1-NY'B^x. E 
Thus the acceptance region of a T’-test is 
(20) А = [zy, Biz, B^ 'z, <k, B positive definite} 


for a suitable k. 

The function zy B^!z, is convex in (z, B) for B positive definite (Problem 
5.17). Therefore, the set z, B !zy < К is convex. This shows that the set A is 
convex. Furthermore, the closure of А is convex (Problem 5.18). and the 
probability of the boundary of 4 is 0. 

Now consider the other condition of Theorem 5.6.5. Suppose A is disjoint 
with the half-space 


(21) c<w'y=v'zy— 3w AB, 
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where A is a symmetric matrix and B is positive semidefinite. We shall take 
А, =I. We want to show that €, +A@ € Q — No; that is, that vi +Av #0 
(which is trivial) and A, + AA is positive definite for A > 0. This is the case 
when A is positive semidefinite. Now we shall show that a half-space (21) 
disjoint with A and A not positive semidefinit2 implies a contradiction. If A 
is not positive semidefinite, it can be written (by Corollary A.4.1 of the 
Appendix) 


I оо 
(22 A-D|0 -I 0р0’, 
0 0 0 
where D is nonsingular. If A is not positive semidefinite, —J is not vacuous, 


because its order is the number of negative characteristic roots of A. Let 
zs, =(1/y)z and 


1 0 0 
(23) B-(D)'|0 уг 0р 
0 0 I 
Then 
1 1 -I 0 0 
(24) оу = zv'z tt 0 yi 0j, 
0 0 0 


А I 0 0 
(25) ава = 252500 УМ 0 |D'n, 
Y 0 0 I 


which is less than k for sufficiently large y. This contradicts the fact that (20) 
and (21) are disjoint. Thus the conditions of Theorem 5.6.5 are satisfied and 
the theorem is proved. и 


This proof is due to Stein. 

An alternative proof of admissibility is to show that the T?-test is a proper 
Bayes procedure, Suppose an arbitrary random vector X has density fixo» 
for w € О. Consider testing the null hypothesis H,:« € О, against the 
alternative H, :« € Q — Ny. Let II, be a prior finite measure on Qo, and П, 
a prior finite measure on Q,. Then the Bayes procedure (with 0-1 loss 
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function) is to reject Ну if 


Д/с) п аә) 
———————— > 


(26) . > 
Јо) (аә) 


с 


for some c (0 < c < оо). If equality in (26) occurs with probability 0 for all 
о Є О, then the Bayes procedure is unique and hence admissible. Since the 
measures are finite, they can be normed to be probability measures. For the 
T?-test of Hy: — 0a pair of measures is suggested in Problem 5.15. (This 
pair is not unique.) The reader can verify that with these measures (26) 
reduces to the complement of (20). 

Among invariant tests it was shown that the T?-test is uniformly most 
powerful; that is, it is most powerful against every value of рУи among 
invariant tests of the specified significance level. We can ask whether the 
T)-test is “best” against a specified value of р’ X^ !j among all tests. Here 
"best" can be taken to mean admissible minimax; and “minimax” means 
maximizing with respect to procedures the minimum with respect to parame- 
ter values of the power. This property was shown in the simplest case of 
p —2and N =3 by Giri, Kiefer, and Stein (1963). The property for general p 
and N was announced by Salaevskii (1968). He has furnished a proof for the 
case of p=2 (Salaevskii (1971)], but has not given a proof for p>2. 

Giri and Kiefer (1964) have proved the T?-test is locally minimax (as 
p X^! 0) and asymptotically (logarithmically) minimax as Z^! i > оо. 


5.7. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


5.7.1. Observations Elliptically Contoured 


When x,,...,%, constitute a sample of N from 
(1) [Al Óg[(x — v) A^ (x v)], 


the sample mean X and covariance $ are unbiased estimators of the distribu- 
tion mean р =» and covariance matrix X-—(4R?/p)A, where В?= 
(X — v)'A7 (X — v) has finite expectation. The T?-statistic, T?— N(x — 
и) 5 (X — р), can be used for tests and confidence regions for p when X 
(or A) is unknown, but the small-sample distribution of T? in general is 
difficult to obtain. However, the limiting distribution of T? when N > co is 


obtained from the facts that VN (x — p) 5 N(0, X) and 5 5 У, (Theorem 
3.6.2). 
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Theorem 5.7.1. Let x,,..., xy be a sample from (1). Assume ФЕ? < oo. 
Then T? 5 x1, 


Proof Theorem 3.6.2 implies that N(x — p) !(x— p) Sx? and M(x 
— п) (2-р) - T? 5 0. и 


Theorem 5.7.1 implies that the procedures in Section 5.3 can be done оп 
an asymptotic basis for elliptically contoured distributions. For example, to 
test the null hypothesis p = p, reject the null hypothesis if 


(2) N(X-p)'S !(x- by) 2 xj (a). 


where x; (æ) is the o-significance point of the x?-distribution with p degrees 
of freedom: the limiting probability of (2) when the null hypothesis is true 
and № оо is o. Similarly the confidence region N(x — т)'57'(& — т) < 


xj (a) has limiting confidence 1 — a. 


5.7.2. Elliptically Contoured Matrix Distributions 
Let X (N X p) have the density 


(3) ici ^g [e t xm еу») (X-en (C) "| 


based on the left spherical density g(Y'Y). Here Y has the representation 
Y € UR', where U (N X p) has the uniform distribution on O(N X p), R is 
lower triangular, and U and R are independent. Then X£e,v'+UR'C’. 
The Т 2-сгіќегіоп to test the hypothesis v = 0 is Nx'S^!x, which is invariant 
with respect to transformations X — XG. By Coroilary 4.5.5 we obtain the 
following theorem. 


Theorem 5.7.2. Suppose X has the density (3) with v=0 and Т? = 
Nx'S-!x. Then [T?/(N — DIN — p)/p] has the distribution of F, n-p = 
(х2/р) Д xs, /IUN — р). 


Thus the tests of hypotheses and construction of confidence regions at 
stated significance and confidence levels are valid for left spherical distribu- 
tions. 

The T?-criterion for H : v = 0 is 


(4) т? = №'5-1 S Мел, 
since X £ UR'C', | 

x! 1 + d 1 t r r Fri + 
(5) х = penx [дей] RIC =и’(СВ)', 
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and ` 
1 ру 
(6) S= хех Nee) = ът CRU URC' - CRuüu'(C' R)] 
= CRS,(CR)'. 


5.7.3. Linear Combinations 


Lauter, Glimm, and Kropf (1996a, 1996b, 1996c) have observed that a statisti- 
cian can use X'X = CRR'C' when v = 0 to determine a p Xq matrix D and 
base a T-test on the transform Z = XD. Specifically, define 


о z = henz =7'D, 

1 Lp! 
(8) $,- (2:2 Net’) = D'SD, 
(9) Т2 = Nz'Sz'z'. 


Since QuZ  Q,UR'C' É UR'C' =Z, the matrix Z is based on the left- 
spherical YD and hence has the representation Z — VR*', where V (мх д) 
has the uniform distribution on O(N Xp), independent of R*' (upper 
triangular) having the distribution derived from R* R*' = Z'Z. The distribu- 
tion of T?/(N — 1) is Е, м_.9/(М- 9)- 

The matrix D can also involve prior information as well as knowledge of 
X'X. If p is large, q can be small; the power of the test based on Тр may be 
more powerful than a test based on T^. 

Lauter, Glimin, and Kropf give several examples of choosing D. One of 
them is to chose D (p x 1) as[Diag(X' X) Je,» where Diag А is a diagonal 
matrix with ith diagonal element a;;. The statistic Тр is called the standard- 
ized sum statistic: 


PROBLEMS 


5.1. (Sec. 5.2) Let x, be distributed according to N(p + В(2,— 2), X) а= 
1,..., №, where 2 = (1/N)Yz,. Let b = 1/XG, -z?]x,G,-2)1(N 2)5 = 
Y[x,—x—b(,—-2)x,- x — H(z, - 2), and T?-2Y(z,-2Yb'S !b. Show 
that T? has the T?-distribution with № —2 degrees of freedom. [ Hinr: See 
Problem 3.13.] 


5.2. (Sec. 52.2) Show that T?/(N — 1) can be written as R^/(1 — R?) with the cor- 
respondences given in Table 5.1. 
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53. 


5.4. 


5.5. 


5.6. 
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Table 5.1 

Section 5.2 Section 4.4 
Хоа = 1/УМ Zia 

х, 12) 

Nx ар = Хао 
B-LYxQX, Ay = ED 
1= рт a= Ez 

T? В? 
N-1 1- А? 
р p-1 
N n 
(Sec. 5.22) Let 
R? Tu, x, (Exa xh) Lua Xe 


1-82 Dul- Lux (Late) Elaa 


where up... uy are N numbers and x,,..., хм are independent, each with the 
distribution №0, X). Prove that the distribution of R2/(1 — R?) is independent 
of u,...,uy. [Hint: There is an orthogonal МХМ matrix C that carries 
(u,,..., uy) into a vector proportional to Q/ VN ,...,1/ YN)] 


(Sec. 5.22) Use Problems 5.2 and 5.3 to show that [T?/(N — DIN — р)/р] 
has the F, y_,-distribution (under the null hypothesis) [№ ие: This is the 
analysis that corresponds to Hotelling’s geometric proof (1931). 


(Sec. 5.22) Let Т? = Nx'S^!x, where X and S are the mean vector and 
covariance matrix of a sample of N from N(p, X). Show that T? is distributed 
the same when p is replaced by А = (7,0,..., 0)’, where т2 = wip, and X is 
replaced by Г. 


(Sec, 5.2.2) Let и = [ТКМ — ОМ + T?/(N - D). Show that и = 
yV'(W')'Vy', where y =(1/VN,...,1/ YN) and 


Vi Xu 07. XN 
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5.7. (Sec. 5.2.2) Let 
vř =v], 
жор Vii _ 1 р А 
v? =p, sivi m vi ри , і%+ 1, 
к A) 
Y Y vivi 1; 
vi 
V* = : 
* 
vp 


vivi Ут] 
*Y fk yee eo \ Tt ж 
v2 v2v2 v2 Vp v3 
1 
= * . 
Ww = ЕЕ *! 
yey Й . Y 
, жж, * 
Vp v, V2 v, Vp L^ 
Hint: EV = V*, where 
1 0 
r 
_ v2" 1 
vw) 
E= 

. О 

уу 

p.i 

-—z 0 - 1 
yyy 


5.8. (Sec. 5.2.2) Prove that w has the distribution of the square of a multiple 


correlation between one vector and р-1 vectors in (№ — 1)-space without 
subtracting means; that is, it has density 


o TROND]  .u»- val 
iWin)” (aw), 


[ Hint: The transformation of Problem 5.7 is a projection of v,,...,v,, y on the 
— р P 
(N — 1)-space orthogonal to v,.] 


5.9. (Sec. 5.2.2) Verify that r=s/(1 —s) multiplied by (N — 1)/1 has the noncen- 


tral F-distribution with 1 and N —1 degrees of freedom and noncentrality 
parameter №2, 


5.12. 


5.13. 


5.14. 


5.15. 


5.16. 


5.17. 


5.18. 


5.19. 
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. (Sec. 5.2.2) From Problems 5.5-5.9, verify Corollary 5.2.1. 


. (Sec. 5.3) Use the data in Section 3.2 to test the hypothesis that neither drug 


has a soporific effect at significance level 0.01. 


(Sec. 5.3) Using the data in Section 3.2, give a confidence region for р with 
confidence coefficient 0.95. 


(Sec. 5.3) Prove the statement in Section 5.3.6 that the T?-statistic is indepen- 
dent of the choice of C. 


(Sec. 5.5) Use the data of Problem 4.41 to test the hypothesis that the mean 
head length and breadth of first sons are equal to those of second sons at 
significance level 0.01. 


(Sec. 5.6.2) T?-test as a Bayes procedure [Kiefer and Schwartz (1965)]. Let 
ху. Xy be independently distributed, each according to Ми, X). Let Пу be 
defined by [ы, E] = [0,07 + т) '] with м having a density proportional to 
[I dq] 2%, and let П, be defined by [в E] = [U + nan, O + 02] 
with м having a density proportional to 


+n?” exp[ Nw ало) а]. 


(a) Show that the measures are finite for N > p by showing q'(I +n) n x1 
and verifying that the integral of |I + тті I = (1 +’) 2 is finite. 
(b) Show that the inequality (26) is equivalent to Nx'(E x x4) > К. 

Hence the 7?-test is Bayes and thus admissible. 


(Sec. 5.6.2) Let 2(1) —f[ty, + (1— yl, where Ку) is a real-valued function 
of the vector y. Prove that if g(t) is convex, then Ку) is convex. 


(Sec. 5.6.2) Show that z'B^!z is a convex function of (z, B), where B is a 
positive definite matrix. [ Hint: Use Problem 5.16.] 


(Sec. 5.6.2) Prove that if the set A is convex, then the closure of A is convex. 


(Sec. 5.3) Let x and S be based оп N observations from N(p, X), and let x 
be an additional observation from N(p, X). Show that x —x is distributed 
according to 


N[0, (1 - 1/N)£]. 


Verify that [N/CN + Dx — xY S7 (x — x) has the T?-distribution with N — 1 
degrees of freedom. Show how this statistic can be used to give a prediction 
region for x based on X and S (ie, a region such that one has a given 
confidence that the next observation will fall into it). 
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5.20. (Sec. 5.3) Let x be observations from Муро, £j), а=1,..., М. i= 1,2. Find 


the likelihood ratio criterion for testing the hypothesis и? = po. 


5.21. (бес. 5.4) Prove that u'X^!p is larger for p' = (ш, шз) than for p = шу by 
verifying 


2 

1 (ado ши, и}, (na 002 79) 
—-2m9 + 1T 745 03 2 7 

1-р? aj 9,0; о; gi (1- p?)o3 


Discuss the power of the test ші = 0 compared to the power of the test u, = 0, 
из = 0. 


5.22. (Ѕес. 5.3) 


(a) Using the data of Section 5.3.4, test the hypothesis uP = up. 


(b) Test the hypothesis pP = uP, pP = uP. 


5.23. (Sec. 5.4) Let 


Prove ШУ 1р = p™ тр). Give a condition for strict inequality to hold. 
[ Hint: This is the vector analog of Problem 5.21.] 


5.24. Let ХО’ = (y0', Z0"), i= 1,2, where Y^ has p components and Z‘ has q 
components, be distributed according to N(p?, X), where 


- А uo - n. 
Find the likelihood ratio criterion (or equivalent T criterion) for testing p? = 


uC given pi? =p on the basis of a sample of № on XO, i= 1,2. [Hint 


Express the likelihood in terms of the marginal density of ү‹ and the 
conditional density of Z given Y ?.] 


5.25. Find the distribution of the criterion in the preceding problem under the null 
hypothesis. 


5.26. (Sec. 5.5) Suppose x{® is an observation from NW!) am hee №. 
8 = 1,...,9- 
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(a) Show that the hypothesis p® =- =p is equivalent to $670 = 0, 
1=1,...,4 — 1, where 
1 N N; 
v of M is 1 QD 
U) = gie + Zn xa — У, x + i x |, 
Ya 78 L SN, № B-1 (MN)! 81 
а=1,....М, #=1.....9-1; 
ММ. 8=2,..., 4; and (af... a), 1 1,...,q — 1, are linearly inde- 
pendent. s PIN 
(b) Show how to construct a T?-test of the hypothesis using (y(?,..., 387 D") 


yielding an F-statistic with (q — 1)p and N - (q — Ор degrees of freedom 
[Anderson (1963b)]. . 


5.27. (Sec. 5.2) Prove (25) is the density of V = yi /( x2 + x2). [Hint: In the joint 
density of U = x2 and W = x? make the transformation u = vw(1 ~ v)™t, w =w 
and integrate out w.] 


CHAPTER 6 


Classification of Observations 


6.1. THE PROBLEM OF CLASSIFICATION 


The problem of classification arises when an investigator makes a number of 
measurements on an individual and wishes to classify the individual into one 
of several categories on the basis of these measurements. The investigator 
cannot identify the individual with a category directly but must use these 
measurements. In many cases it can be assumed that there are a finite num- 
ber of categories or populations from which the individual may have come and 
each population is characterized by a probability distribution of the measure- 
ments. Thus an individual is considered as a random observation from this 
population. The question is: Given an individual with certain measurements, 
from which population did the person arise? 

The problem of classification may be considered as a problem of “statisti- 
cal decision functions." We have a number of hypotheses: Each hypothesis is 
that the distribution of the observation is a given one. We must accept one of 
these hypoth:ses and reject the others. If only two populations are admitted, 
we have an elementary problem of testing one hypothesis of a specified 
distribution against another. 

In some instances, the categories are specified beforehand in the sense 
that the probability distributions of the measurements are assumed com- 
pletely known. In other cases, the form of each distribution may be known, 
but the parameters of the distribution must be estimated from a sample from 
that population. 

Let us give an example of a problem of classification. Prospective students 
applying for admission into college are given a battery of tests; the vector of 
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Scores is a set of measurements x. The prospective student may be a member 
of one population consisting of those students who will successfully complete 
college training or, rather, have potentialities for successfully completing 
training, or the student may be a member of the other population, those who 
will not complete the college course successfully. The problem is to classify a 
student applying for admission on the basis of his scores on the entrance 
examination. 

In this chapter we shall develop the theory of classification in general 
terms and then apply it to cases involving the normal distribution. In Section 
6.2 the problem of classification with two populations is defined in terms of 
decision theory, and in Section 6.3 Bayes and admissible solutions are 
obtained. In Section 6.4 the theory is applied to two known normal popula- 
tions, differing with respect to means, yielding the population linear dis- 
criminant function. When the parameters are unknown, they are replaced by 
estimates (Section 6.5). An alternative procedure is maximum likelihood. In 
Section 6.6 the probabilities of misclassification by the two methods are evalu- 
ated in terms of asymptotic expansions of the distributions. Then these devel- 
opments are carried out for several populations. Finally, in Section 6.10 linear 
procedures for the two populations are studied when the covariance matrices 
are different and the parameters are known. 


6.2. STANDARDS OF GOOD CLASSIFICATION 


6.2.1. Preliminary Considerations 


In constructing a procedure of classification, it is desired to minimize the 
probability of misclassification, or, more specifically, it is desired to minimize 
on the average the bad effects of misclassification. Now let us make this 
notion precise. For convenience we shall now consider the case of only two 
categories. Later we shall treat the more general case. This section develops 
the ideas of Section 3.4 in more detail for the problem of two decisions. 

Suppose an individual is an observation from either population 7, or 
population 7. The classification of an observation depends on the vector of 
measurements x’ = (x,,...,x,) on that individual. We set up a rule that if an 
individual is characterized by certain sets of values of x,..., x, that person 
will be classified as from 7,, if other values, as from т,. 

We can think of an observation as a point in a p-dimensional space. We 
divide this space into two regions. If the observation falls in R,, we classify it 
as coming from population т;, and if it falls in R, we classify it as coming 
from population т,. 

In following a given classification procedure, the statistician can make two 
kinds of errors in classification. If the individual is actually from 7,, the 
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Table 6.1 


Statistician's Decision 


т ША 
т {0 CQID 
Population т. cai 7 


statistician can classify him or hex as coming from population т; if from тз, 
the statistician can classify him or her as from т. We need to know the 
relative undesirability of these two kinds of misclassification. Let the cost of 
the first type of misclassification be C(211) C» 0), and let the cost of mis- 
classifying an individual from т; as from п, be CUID) (> 0). These costs 
may be measured in any kind of units. As we shall see later, it is only the 
ratio of the two costs that is important. The statistician may not know these 
costs in each case, but will often have at least a rough idea of them. | 

Table 6.1 indicates the costs of correct and incorrect classification. Clearly. 
a good classification procedure is one that minimizes in some sense or other 
the cost of misclassification. 


6.2.2. Two Cases of Two Populations 


We shall consider ways of defining "minimum cost" іп two cases. In one case 
we shall suppose that we have a priori probabilities of the two populations. 
Let the probability that an observation comes from population T, be q, an 
from population m, be q> (q,* q; = 1). The probability properties of pope 
lation т, are specified by a distribution function. For convenience we S ar 
treat only the case where the distribution has a density, although the case 9 
discrete probabilities lends itself to almost the same treatment. Let the 
density of population 7, be p(x) and that of ud be p(x). If we have a 
region R, of classification as from т, the probability of correctly classitying 
an observation.that actually is drawn from population m, 15 


(1) POIL В) = Ј pil) dr. 


where dx = dx, «+: ах, and the probability of misclassification of an observa- 
tion from m, is 


Р(211, К) = x) dx. 
(2) (211, R) JP ) 
Similarly, the probability of correctly classifying an observation from т. is 


(3) P212, В) = | p) dr. 
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and the probability of misclassifying such an observation is 
(4) Р(12, В) = f р(х) dx. 
1 


Since the probability of drawing an observation from 7, is qj, the 
probability of drawing an observation from 7, and correctly classifying it is 
g, PH, В); that is, this is the probability of the situation in the upper 
left-hand corner of Table 6.1. Similarly, the probability of drawing an 
observation from т, and misclassifying it is q; P2] 1, К). The probability 
associated with the lower left-hand corner of Table 6.1 is q; P(1|2, В), and 
with the lower right-hand corner is q, P(2|2, R). 

What is tac average or expected loss from costs of misclassification? It is 
the sum of the products of costs of misclassifications with their respective 
probabilities of occurrence: 


(5) C(211) P(211, R)q, + C(112) P(112, К). 


It is this average loss that we wish to minimize. That is, we want to divide our 
space into regions R, and R; such that the expected loss is as small as 
possible. A procedure that minimizes (5) for given q, and q; is called a Bayes 
procedure. 

In the example of admission of students, the undesirability of misclassifica- 
tion is, in one instance, the expense of teaching a student who will noi 
complete the course successfully and is, in the other instance, the undesirabil- 
ity of excluding from college a potentially good student. 

The other case we shall treat is that in which there are no known a priori 
probabilities. In this case the expected loss if the observation is from 77, is 


(6) C(211) P(2l1, В) = r(1, В); 
the expected loss if the observation is from 7, is 
(7) C(112) P(112, R) =r(2, К). 


We do not know whether the observation is from m, or from т», and we do 
not know probabilities of these two instances. 

A procedure R is at least as good as a procedure R* if r(1, В) <r, R*) 
and r(2, R) < r(2, R*); R is better than R* if at least one of these inequalities 
is a strict inequality. Usually there is no one procedure that is better than all 
other procedures or is at least as good as all other procedures. A procedure 
В is called admissible if there is no procedure better than R; we shall be 
interested in the entire class of admissible procedures. It will be shown that 
under certain conditions this class is the same as the class of Bayes proce- 
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dures. A class of procedures is complete if for every procedure outside the 
class there is one in the class which is better; a class is called essentially 
complete if for every procedure outside the class there is one in the class 
which is at least as good. A minimal complete class (if it exists) is a complete 
class such that no proper subset is a complete class; a similar definition holds 
for a minimal essentially complete class. Under certain conditions we shall 
show that the admissible class is minimal complete. To simplify the discussion 
we shall consider procedures the same if they only differ on sets of probabil-. 
ity zero. In fact, throughout the next section we shall make statements which 
are meant to hold except for sets of probability zero without saying so explicitly. 

A principle that usually leads to a unique procedure is the minimax 
principle. A procedure is minimax if the maximum expected loss, r(i, К), is a 
minimum. From a conservative point of view, this may be considered an 
optimum procedure. For a general discussion of the concepts in this section 
and the next see Wald (1950), Blackwell and Girshick (1954), Ferguson 
(1967), DeGroot (1970), and Berger (1980b). 


6.3. PROCEDURES OF CLASSIFICATION INTO ONE OF TWO 
POPULATIONS WITH KNOWN PROBABILITY DISTRIBUTIONS 


6.3.1. The Case When A Priori Probabilities Are Known 


We now turn to the problem of choosing regions R, and R, so as to mini- 
mize (5) of Section 6.2. Since we have a priori probabilities, we can define joint 
probabilities of the population and the observed set of variables. The prob- 
ability that an observation comes from т, and that each variate is less than 
the corresponding component in y is | 


(1) fi fano dx, ^ Ч. 


We can also define the conditional probability that an observation came from 
a certain population given the values of the observed variates. For instance, ` 
the conditional probability of coming from population 7}, given an observa- 
tion x, is 


(2) qipi( x) 
qipi( x) + aa p(x) 


Suppose for a moment that C(1|2) = C(@2|1) = 1. Then the expected loss is 


G) af, рх) dee as f, pal x) а. 
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This is also the probability of a misclassification; hence we wish to minimize 
the probability of misclassification. 
For a given observed point x we minimize the probability of a misclassifi- 


cation by assigning the population that has the higher conditional probability. 
If 


qı p(x) 42Р2(х) 
(4) qi P(x) + 92 po(x) = qypi( x) + 42 Pox)’ 


we choose population 7,. Otherwise we choose population тз. Since we 
minimize the probability of misclassification at each point, we minimize it 
over the whole space. Thus the rule is 


(5) Ry р(х) = 42 p2(x), 
R2: qı pi(x) < qo p(x). 


If qı p(x) =q, р(х), the point could be classified as either from T, OF T3; 
we have arbitrarily put it into R,. If q, р(х) +q, p, (x) = 0 for a given x, that 
point also may go into either region. 

Now let us prove formally that (5) is the best procedure. For any proce- 
dure R* = (Rf, R$), the probability of misclass'fication is 


(6) af PG) deas f. p(x) dx 


= flap) -qı pa(x)] dx + q f pa( x) dx. 


On the right-hand side the second term is a given number; the first term is 
minimized if R5 includes the points x such that q, р(х) — q, p,(x) < 0 and 
excludes the points for which q; p(x) — р(х) > 0. If 


G UT = 


п] =0, і= 1,2. 


then the Bayes procedure is unique except for sets of probability zero. 

Now we notice that mathematically the problem was: given nonnegative 
constants 4, and q, and nonnegative functions р(х) and p,(x), choose 
regions К; and К, so as to minimize (3). The solution is (5). If we wish to 
minimize (5) of Section 6.2, which can be written 


(8) [C104] f p(x) de + [C(112) а] f. pa Cx) d, 
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we choose R, and R, according to 


Ry [C1 1)a] pil +) > [C(112)4;] р(х). 


© Во: [С(211)а | p(x) < [С(112)8:| р(х), 


since C(211)g, and C(1|2)g; are nonnegative constants. Another way of 
writing (9) is 


‚ pix) > C(112)4; 
px) ^ €QIDai' 


pt). C002 
p) < С@П)а, ` 


R, 
(10) 


2 


Theorem 6.3.1. №4, aid q, are a priori probabilities of drawing an 
observation from population зт; with density p (x) and m, with density p(x), 
respectively, and if the cost of misclassifying an observation from ту as from Ta 
is C(2|1) and an observation from m, as from m, is C(1|2), then the regions оў 
classification R, and R,, defined by (10), minimize the expected cost. If 


(11) Де _ 4)C(112) | =o, i= 1,2. 


pix) С(211) 
then the procedure is unique except for sets of probability zero. 


6.3.2. The Case When No Set of A Priori Probabilities Is Known 


In many instances of classification the statistician cannot assign a priori 


probabilities to the two populations. In this case we shall look for the class of 
admissible procedures, that is, the set of procedures that cannot be improved 
UP Fist let us prove that a Bayes procedure is admissible. Let R- (Ri, К) 
be a Bayes procedure for a given 41, 492; is there a procedure R = (Rt, R3) 
such that P(1|2, R*) x P(3I2, В) апа P(2|1, R*) < Р(211, R) with at least 
one strict inequality? Since R is a Bayes procedure, 


(12) q,P(211, К) + 4;P(112, К) < q,P(211, R*) + q; P(1I2, R*). 
This inequality can be written 


(13) 4[Р(211, К) - Р(211, R*)] < 4 P(12, R*) – Р(И2, R)]. 
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Suppose 0 <q, < 1. Then if P(1|2, R*) < P(1I2, R), the right-hand side of 
(13) is less than zero and therefore P(2|1, R) < P(2|1, R*). Then P(2|1, R*) 
< Р(211, R) similarly implies Р(112, R) < Р(112, R* ) Thus R* is not better 
than В. and R is admissible. If q, = 0, then (13) implies 0 < P2, R*) - 
Р(112, R). For a Bayes procedure, R, includes only points for which p(x) = 0. 
Therefore, P(1|2, К) = 0 and if R* is to be better Р(112, R*)2 0. If Pr( p(x) 
= 011т,) = 0, then POI, В) = Pri p(x) > 0| m) = 1. И Р(12, R*)= 0, then 
R* contains only points for which p(x) = 0. Then P(2\1, R*) = Pr{R3| mi) 
= Pr{ p.(x) > | п} = 1, and R* is not better than К. 


Theorem 6.3.2. If РИр.(х) = 0 п} = 0 and Pr{p,(x) = 01т,) = 0, then 
every Bayes procedure is admissible. 


Now let us prove the converse, namely, that every admissible procedure is 
a Bayes procedure. We assume* 


; pí(x) _ 
(14) РР k 


ni} =0, |d51,2, 0xkzo. 


Then for any q, the Bayes procedure is unique. Moreover, the cdf of 
pO /p;G) for т; and m, is continuous. 
Let R be an admissible procedure. Then there exists a k such that 


3 


where R* is the Bayes procedure corresponding to 4/4 =k [1е., а. = 1 /а 
+k)]. Since В is admissible, Р(112, В) < Р(112, R* ) However, since by 
Theorem 6.3.2 R* is admissible, Р(112, В) > P(1|2, R*); that is, Р(112, В) = 
P(112, R*). Therefore, R is also a Bayes procedure; by the uniqueness of 
Bayes procedures К is the same as R*. 


_ pí(x) , 
(15) Р(211, R) = p(B sk 


= P(2l1, R*), 


Theorem 6.3.3. If (14) holds, then every admissible procedure is a Bayes 
procedure. 


The proof of Theorem 6.3.3 shows that the ciass of Bayes procedures is 
complete. For if К is any procedure outside the class, we construct a Bayes 
procedure R* so that P(2|1, В) = POI, R*). Then, since R* is admissible, 
Р(112. R) > P(1]2, R*). Furthermore, the class of Bayes procedures is mini- 
mal complete since it is identical with the class of admissible procedures. 


p / pax) = oc means p(x) = 0. 
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Theorem 6.3.4. If (14) holds, the class of Bayes procedures is minimal 
complete. 


Finally, let us consider the minimax procedure. Let РО], 41) = P(ilj, В), 
where R is the Bayes procedure corresponding to q,. P(ilj, 41) is a continu- 
ous function of q,. Р(211, 41) varies from 1 to 0 as q, goes from 0 to 1; 
Р(112, q,) varies from 0 to 1. Thus there is a value of q}, say qj, such that 
РО 1, 4*) = P(1|2, g#). This is the minimax solution, for if there were 
another procedure R* such that max{P(2|1, R*), P(112, R*)} x PO|1, 9%) = 
P(112, q1), that would contradict the fact that every Bayes solution is admissi- 
ble. 


6.4. CLASSIFICATION INTO ONE OF TWO KNOWN MULTIVARIATE 
NORMAL POPULATIONS 


Now we shall use the general procedure outlined above in the case of two 
multivariate normal populations with equal covariance matrices, namely, 
N(p, 5) and N(p®, X), where p® = (uf, ..., uf?) is the vector of means 
of the ith population, i= 1,2, and X is the matrix of variances and covari- 


ances of each population. [The approach was first used by Wald (1944).] 
Then the ith density is 


1 
1 (ху = 2 
О еу 


exp| -$(x-p?yxE-(x- y?) . 
The ratio of densities is 


Q9 .se[-:(6- yx oe) 
р.(х) exp[- i(x- Oy X^! (x - p)| 
-opl Gr ay Qa 

-(x- py E(x- u?)]). 


The region of classification into т, R,, is the set of x's for which (2) is 
greater than or equal to k (for k suitably chosen). Since the logarithmic 
function is monotonically increasing, the inequality can be written in terms of 
the logarithm of (2) as 


(3) Hile- MOVE (x p) = (x- iy E (e 8] 21084. 
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The left-hand side of (3) can be expanded as 
(4) [УрО ~ pO x-iy pE pO 
=x E x+ aO + pE- pO E pA]. 
By rearrangement of the terms we obtain 
(5) x'X- (pO — pw) — 100 + pV E (о — a, 


The first term is the well-known discriminant function. Yt is a function of the 
components of the observation vector. 


The following theorem is now a direct consequence of Theorem 6.3.1. 


Theorem 6.4.1. If a; has the density (1), i— 1,2, the best regions of 
classification are given by 


(6 Ry: x XC (p — pw) — Fw + роу цр — nO) > log k, 
Ry: x Ep =p) — Кро + роу - (g( — pO) « log k. 


If a priori probabilities qu and q, are known, then К is given by 


_ 4,С(112) 
(7) k= сп) 


In the particular case of the two populations being equally likely and the 
costs being equal, k — 1 and log = 0. Then the region of classification into 


T, 15 
. (8) Ri: x'X (p? — pw) > 100 + pO) E (pe — њо). 


If we dc not have a priori probabilities, we may select log К = с, say, on the 
basis of making the expected losses due to misclassification equal. Let X be a 
rancom observation. Then we wish to find the distribution of 


(9) U -X'X (gk — p?) — z(p® + PYE (p — 2) 


on the assamption that X is distributed according to №}, X) and then on 
the assumption that X is distributed according to N(p®, X). When X is 
distributed according to М(Х, Ӯ), U is normally distributed with mean 


(10 EU = uO'X- 0) — Кро + pO) X-1 ро — y) 
= z(a” — py Sop — p?) 
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and variance 


(11) Var,(U) = (в D yE- (X- u®)(X- po?) E^ (и D p?) 
= (p? — a X^ IC. D p”). 


The Mahalanobis squared distance between N(p?, X) and Мы”, X) is 
(12) (p? — pP E (MP - pO) = А, 


say. Then U is distributed according to NGA, A?) if X is distributed 


' according to N(p, X). If X is distributed according to N(p, X), then 


(13) 6,U = pO X! (gt? _ B) 2 $(p + py z- (po — p?) 


= 5080 _ pO) X (p® — ) 


The variance is the same as when X is distributed according to М. >) 
because it depends only on the second-order moments of X. Thus U is 
distributed according to N(— 142, A’). o | 

The probability of misclassification if the observation is from m, is 


c l оная reds Lat 
(14) Ра) = Ј тре MEE mre " dy. 


-X YET 


and the probability of misclassification if the observation is from т, is 


P(112) = ? ul eiert де |? | LS 
(15) Р = Ј У2тА (c+ tatyya Ут 


Figure 6.1 indicates the two probabilities as the shaded portions in the tails. 


ад? о € А? 
$5 


Figure 6.1 
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For the minimax solution we choose c so that 


(16) cum К NE T dy = соі) fo ina re P dy.. 


7T 


Theorem 6.4.2. If the m, have densities (1), i = 1,2, the minimax regions of 
classification are given by (6) where c — log k is chosen by the condition (16) with 
СИЛ the two costs of misclassification. 


It should be noted that if the costs of misclassification are equal, c = 0 and 
the probability of misclassification is 


e dy. 


* 1 
(17) ) ВЕ 


In case the costs of misclassification are unequal, c could be determined to 
sufficient accuracy by a trial-and-error method with the normal tables. 
Both terms in (5) involve the vector 


(18) S=% (p -— ©), 
This is obtained as the solution of 
(19) XS = (pw? - p?) 


by an efficient computing method. The discriminant function x'S is the linear 
function that maximizes 


[é(x'4) - &(x'd)] 
Var( X'd) 


(20) 


for all choices of d. The numerator of (20) is 


(21) [iy d ре Vd a| -4' Кю? — D) р D 2а; 
the denominator is 
(22) d'é(X- éX)(X- éX)'d-d'Xd. 


We wish to maximize (21) with respect to d, holding (22) constant. If A is a 
Lagrange multiplier, we ask for the maximum of 


(23) а – p)(p'? - y?) ]d - A(a^X4 - 1). 
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The derivatives of (23) with respect to the components of d are set equal to 
zero to obtain 


(24) 2| (p^ — pO )(uf? — и) 14 = 2AX4. 


Since (p) — д) is a scalar, say v, we can write (24) as 
(25) pO = p? = Aya 
, *d. 


Thus the solution is proportional to 5. 

We may finally note that if we have a sample of N from either 7, Or T», 
we use the mean of the sample and classify it as from Му, (1/N)X] or 
NIpO, Q/N)X]. 


6.5. CLASSIFICATION INTO ONE OF TWO MULTIVARIATE NORMAL 
POPULATIONS WHEN THE PARAMETERS ARE ESTIMATED 


6.5.1. The Criterion of Classification 


Thus far we have assumed that the two populations are known exactly. In 
most applications of this theory the populations are not known, but must be 
inferred from samples, one from each population. We shall now treat the 
case in which we have a sample from each of two normal populations and we 
wish to use that information in classifying another observation as coming 
from one of the two populations. 


Suppose that we have a sample x{”,. 


x$? from N(w, X) and a sample 
х0). 


. x2 from N(p®, X). In one terminology these are “training samples.” 
On the basis of this information we wish to classify the observation x as 
coming from 7, to т). Clearly, our best estimate of a is xD = Y xD/N;, 
of p? is x = УХО /№,, and of X is S defined by 


N 
(1) (N, +N, - 2)5 = Y (xP — #0) (хо — xo 
«-i 
№ 
+ x? — #0) @) xo ín 
ор?) 


We substitute these estimates for the parameters in (5) of Section 6.4 to 
obtain 


(2) W(x) -x'S-'(x'? 2 30) — $(x' OST (x0 - x0). 
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The first term of (2) is the discriminant function based on two samples 
[suggested by Fisher (1936). It is the linear function that has greatest 
variance between samples relative to the variance within samples (Problem 
6.12). We propose that (2) be used as the criterion of classification in the 
same way that (5) of Section 6.4 is used. 

V/hen the populations are known, we can argue that the classification 
criterion is the best in the sense that its use minimizes the expected loss in 
the case of known a priori probabilities and generates the class of admissible 
procedures when a priori probabilities are not known. We cannot justify the 
use of (2) in the same way. However, it seems intuitively reasonable that (2) 
should give good results. Another criterion is indicated in Section 6.5.5. 

Suppose we have a sample x;,..., x, from either ту or т,, and we wish 
to classify the sample as a whole. Then we define S by 


м 


(3 (N,+N,+N-3)S= У (xP — x) (ath — x)" 


а&=1 


№ N 
+ Y (xP -3)(2 -29y У (73) 0,72). 
а=1 


а=1 
where 
1x 
(4) х = м Y X.. 
Then the criterion is 
(5) [x - 1x9 +O) |si- 2), 


The larger N is, the smaller are the probabilities of misclassification. 


6.5.2. On the Distribution of the Criterion 
Let 


(6) W-X'S'(X9—X9) — (X9 + XO)'s-1(X0 — XO) 
= [x- (хо +Х0) |'5-1( ХФ -xo) 


for random X, X, XO, and S. 
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The distribution of W is extremely complicated. It depends on the sample 
sizes and the unknown A’. Let 


(7) y, - [X - (м+м) (NX? + №0), 
(8) Ү,=с,( xo -X9?), 


where c, = CN, + N,)/(N, + № * 1) and с, = УММ,/(М + №.) . Then 
Y, and Y, are independently normally distributed with covariance matrix >. 
The expected value of У, is c) (p? — py), and the expected value of Y, is 
c N,/(N, + № Жи — pw) if X is from 7, and —c,[ N,/ON, + NK" — 
p) if X is from тз. Let У= (Y, Y;) and 


| /g-ly- my, mp 
(9) M-Y ^im, mal 
Then 
М+М +1 М № a 
(10) W= CON, "et 2N,N, 7 


The density of M has been given by Sitgreaves (1952). Anderson (1951а) and 
Wald (1944) have also studied the distribution of W. 

If М = №, the distribution of W for X from m, is the same as that of 
— W for X from 7. Thus, i£ W z 0 is the region of classification as 7,, then 
the probability of misclassifying X when it is from 7, is equal to the 
probability of misclassifying it when it is from 7). 


6.5.3. The Asymptotic Distribution of the Criterion 


_ In the case of large samples from мро, X) and N(p, E), we can apply 


limiting distribution theory. Since X is the mean of a sample of № 
independent observations from Муро, X), we know that 


(11) plim ХО = ш". 


№ 9% 


The explicit definition of (11) is as follows: Given arbitrary positive 5 and =. 
we can find М large enough so that for № = N 


(12) Pr(IX(? — що < 8, i21... p]21- €. 
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(See Problem 3.23.) This can be proved by using the Tchebycheff inequality. 
Similarly, 


(13) plim X? = ро, 
№ >% 

and 

(14) plim $ = У 


аз N, > zc, № > oo or as both №, №, > oo. From (14) we obtain 
(15) plim 57. = €X^!, 


since the probability limits of sums, differences, products, and quotients of 
random variables are the sums, differences, products, and quotients of their 
probability limits as long as the probability limit of each denominator is 
different from zero [Cramér (1946), p. 254]. Furthermore, 


(16) plim $7'(X-X®) = X^ (y? — p), 
N, N29 


(17) 


plim (X? +y s (EO = Хо) = (pO + py E – но). 


Ni Nao 


It follows then that the limiting distribution of W is the distribution of U. 
For sufficiently large samples from т; and т, we can use the criterion as if 
we knew the population exactly and make only a small error. [The result was 
first given by Wald (1944).] 


Theorem 6.5.1. Let W be given by (6) with X the mean of a sample of N, 
from МО, X), ХӘ the mean of a sample of N, from N(p®, X), and S the 
estimate of X. based on the pooled sample. The limiting distribution of W as 
М, — oo and М, > œ is МУА?, Д?) if X is distributed according to Муро, X) 
and is N(— 1&4, A?) if X is distributed according to МО), X.). 


6.5.4. Another Derivation of the Criterion 


A convenient mnemonic derivation of the criterion is the use of regression of 
a dummy variate [given by Fisher (1936)]. Let 


М 
(18) YO = NEN а=1,..., М, уб = ще, а= 1,..., №. 
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Then formally find the regression on the variates x® by choosing b to 
minimize 


2 N 
(19) у p? -b (x - ғ), 
i=1 а—1 
where 
00 NO 4 NO 
(0) = М+М, 
The normal equations are 
2 м ‚м 
со EE Gb-569-5)5- X EPOP- 
i=l aci ist а=1 
N,N. 8 _ _ - 
= g rK; [E -3) - G9 -2 
N,N, ,_ _ 
= N FN, F” -39). 


The matrix multiplying b can be written as 


2 N 
a E XEGQ9-:(9-s) 
1=1 а=1 i 


М, 


= y (x — 2) (xO #0), 


i=l а=1 


+N,( – #)(20 —x)' + № (50 -x)(x? -x) 


2 Ni 
=E У (xP — 2) (x — 20)" 
i=] а=1 
+ ie (ED — 30)( FO — ZO)", 
1 2 


Thus (21) can be written as 


NN, NMN, 
N, +N, N +N, 


(23) Ab = (20 — so (x? - xOyb|, 
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where 
2 
(24) A- Y (x - x9)(xt? - xe». 


Since (x ) x yb sa scalar we scc that the solution b of 23 is propor 
, S 
tional to 571050) — x) ( 


6.5.5. The Likelihood Ratio Criterion 


Another criterion which can be used in classification is the likelihood ratio 
criterion. Consider testing the composite null hypothesis that x, x(U,... , xf 
are drawn from N(M, X) and x®,...,x® are drawn from м о 5) 
against the composite alternative hypothesis that xD, x are drewn fr 

Мн», >) and х, x{),..., x are drawn from Nu, >) with pO, p and 
х WO HO and Y ar the first hypothesis the maximum likelihood estimators 


z0 
(25) ap. NET 
00 = x@ 
ЕТ v (x - àQ)(x - BP)" + (x- 8)(x- 0) 
i +N, oe а BY &P)(x- BT) 
№ 
+ X (ар-ар) а-ар). 
Since 


№ 
Q9 Y (Р-р + (x= AP) (e= ity 


№ 
=E (xt — 8) (2 — £9) + N (x9 — AP) — gm 
a- 
(ИР) 
м . 
-E (xP — FO) (x0 — £9) + zx zO)(x—xQP, 
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we can write $, as 
(27) зи ме 
1 2 | 
where А is given by (24). Under the assumptions of the alternative hypothesis 


we find (by considerations of symmetry) that the maximum likelihood estima- 
tors of the parameters are : 


py = x) 
N,xO +x 
aD oe l2 o 
(28) р? №1 " 


^ 


1 N, = 

=< А + x—- 2x _ z |. 

2 x Nico) 

The likelihood ratio criterion is, therefore, the (№ + N, + 1)/2th power of 


N. 
А+ №, + 1 (х ~ ¥)(x ~z) 


О О ————M— 


1 peg eae imei] 


(29) i$ 


This ratio can also be written (Corollary А.3.1) 


Leu Goma) 

Wyse 
(30) N, и 

Leger gant o) 


М, =? M 
n+ рО Ys D) 
мт 
а | 
пат 099787006490) 


where n = М, + №, - 2. The region of classification into m, consists of those 
points for which the ratio (30) is greater than or equal to à given number K,. 
It can be written 


N, 2 zG 
(31) Rn АТ (0-29) 


N лу - 
x [iecore] 
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ЕК, — 1 - 2c/n and М, and №, are large, the region (31) is approximately 
W(x) > c. 

If we take К, = 1, the rule is to classify as м, if (30) is greater than 1 and 
as п) if (30) is less than 1. This is the maximum likelihood rule. Let 


1 № 


(32) 2=5 Ne1Q739)'8 (х- #9) 


М ера А 
-Noqo-x9ys Mx — x€)|. 


Then the maximum likelihood rule is to classify as 7, if Z>0 and т. if 


Z«0. Roughly speaking, assign x to m, or т, according to whether the - 


distance to x is less or greater than the distance to x@. The difference 
between W and Z is 


(33 w-z= {|y (x99) '5- 2-9 
( 2 м1 
1 =) ^е-1 =а) 
мя )s^?(x-x?)|, 


which has the probability limit 0 as №, №, > оо. The probabilities of misclas- 


sification with W are equivalent asymptotically to those with Z for large 


samples. 
Note that for N,=N..Z=[N,/(N, + D]W. Then the symmetric test 
based on the cutoff c = 0 is the same for Z and W. 


6.5.6. Invariance 


The classification problem is invariant with respect to transformations 


xQ* = Br) +c, a —1,..,N,, 
(34) x2 = Bx +c, а= 1,...,№, 
х* = Bx +с, 


where B is nonsingular and c is а vector. This transformation induces the 
following transformation on the sufficient statistics: 


(35) xO* = BED кс, — ХО» = Br c, 
х* = Bx + с, S* = BSB', 


with the same transformations on the parameters, p®, рб), and X. (Note 
that &x =p or и.) Any invariant of the parameters is a function of 
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А = (p — uaOyx-i(uD- x). There exists a matrix B and a vector с 
such that 


(36) pO* = Ba +¢=0, pO = By +¢=(A,0,...,0)', 
Z*-BXB'-I. 

Therefore, Д? is the minimal invariant of the parameters. The elements of M 

defined by (9) are invariant and are the minimal invariants of the sufficient 


statistics. Thus invariant procedures depend on М, and the distribution of M 
depends only оп A’. The statistics W and Z are invariant. 


. 6.6. PROBABILITIES OF MISCLASSIFICATION 


6.6.1. Asymptotic Expansions of the Probabilities of Misclassification 
Using W 


We may want to know the probabilities of misclassification before we draw 
the two samples for determining the classification rule, and we may want to 
know the (conditional) probabiliies of misclassification after drawing the 
samples. Аз observed earlier, the exact distributions of W and 2 are very 
difficult to calculate. Therefore, we treat asymptotic expansions of their 
probabilities аз №, and №, increase. The background is that the limiting 
distribution of W and Z is N(3 AP, A?) if x is from т, and is N(— $A’, A?) if 
x is from 7}. | 

Okamoto (1963) obtained the asymptotic expansion of the distribution of 
W to terms of order п-?, and Siotani and Wang (1975, 1977) to terms of 
order n^?. [Bowker and Sitgreaves (1961) treated the case of N, = N,.] Let 
Ф(.) and $(-) be the cdf and density of N(0, 1), respectively. 


Theorem 6.6.1. As №, > оо, №, > оо, and N,/N, >a positive limit (n = 
М, + М, - 2), 


= 142 
(1) nU XA <u 


=Ф(и) – g] NE [№ +(p—3)u — ph] 


1 
23,4 


+ 


[u + 24и + (р- 3+ А?)и + (p - 2)4] 


1 - 
+ gg [46 4A t (6p- 6 А)и+2(р- ра) +0(п7?), 


and РЦ - (QW + ЗА?) /А <и|т,) is (1) with М, and №, interchanged. 
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The rule using W is to assign the observation x to т, if (х) > c and to 
т i W(x) sc. The probabilities of misclassification are given by Theorem 
6. m imc 724 )/A and u= ~ (c + ЗД?) /А, respectively. For c = 0, 
(1965) - N, = №, this defines an exact minimax procedure [Das Gupta 


Corollary 6.6.1 
(2) UE Olr, lim 5} = i} 
=(—1,4)41 -1 
*(-14) + Foa + ВА] +o) 


С Ola, lim M - . 
naw 2 


Note tha: the correction term is positive, as far as this correction goes; 
that is, the probability of misclassification is greater than the value of the 


normal approximation.. The correction term (to order n^!) increases with p 
for given A and decreases with A for given p. 


Since A is usually unknown, it i 
; It is relevant to Studentize И’, The s 
Mahalanobis squared distance | ample 


(3) D? = (#0 — gQy's-!( xt» _ 50) 


is an estimator of the population Mahalanobi i 
i is squared dist 2 
expectation of D? is 1 sance d The 


(4) ED? = —" юр 1 
n-p-i|^*PN + м . 


See Problem 6.14. If №, and №, are large, this is approximately A?. 
Anderson (1973b) showed the following: 


Theorem 6.62. If N,/N, > a positive limit as n > co, 


7 


7969 [ее (0-3) oo. 


w-ip? 
(5) Pr <u 
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+1р? 
(6) rf 2р <и m) 
1(u p-1) 1fv 3 -> 
-o - 6a) x (3 - ^] + А ДЕС ). 


Usually, one is interested in и < 0 (small probabilities of error). Then the 
correction term is positive; that is, the normal approximation underestimates 
the probability of misclassification. 

One may want to choose the cutoff point c so that one probability of 
misclassification is controlled. Let а be the desired РКИ’ < c| т,}. Anderson 
(1973b, 1973c) derived the following theorem: 


Theorem 6.6.3. Let uy be such that (ug) = а, and lei 
1 3 1; 
+ n (» = gjo + gl 


Then as №, > оо, N, > œ, and N,/N, > a positive limit, 


1 -1 1 
(7) U= uo N y 7 900 


-ip? | 
(8) PUR su m) - a0. 


Then c = Du + 1D? will attain the desired probability œ to within Q(n 7). 

We now turn to evaluating the probabilities of misclassification after the 
two samples have been drawn. Conditional on ¥", x™ and 5. the random 
variable W is normally distributed with conditional mean 


(9) (Ит, 39,20, 5) = [u^ (89 +O sc cim - x9) 
= p(z, x, S) 
when x is from 7; i — 1,2, and conditional variance 
(10) Y (WIE, x9, 5) = (x - x9 )'s xs! (xD x0) 
= g? (x9, x9, S). 
Note that these means and variance are functions of the samples with 
probability limits 


plim (39, 30,8) =(- 1)" 38. 
Ni, № 30 


pim о2( 2, 50,5) = a. 
Ny, Na > 


(11) 
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For large № and N, the conditional probabilities of misclassification are 
close to the limiting normal probabilities (with high probability relative to 
x), xO. and S). 

When c is the cutoff point, probabilities of misclassification conditional on 
x), xO. and S are 


— ру x0) zQO 
оо s) Ф А5055) 
(12) P(2|1, c, 2,20, S)=@ c (30,80, 5) , 
Qf zu 602) 
gD <Q eai Ф| TET O,S) 
(13) Р(12,с, X S ,S) 1-@ c (x, xO, 5) 


In (12) write c as Du, + iD?. Then the argument of ФС) in (12) is 
u, D/o + (0 — xe ys (x? — що) ис; the first term converges in probabil- 
ity to u,, the second term tends to 0 as №, > oo, №, > оо, and (12) to ®(u,). 
In (13) write c as Ри, – 4D?. Then the argument of d() in (13) is 
us Био (xU — xy s! (x? — р) ус. The first term converges in proba- 
bility to и, and the second term to 0; (13) converges to 1 — Ф(и,). 

For given x”, х0, and S the (conditional) probabilities of misclassifica- 
tion (12) and (13) are functions of the parameters и‘), ро), У and can be 
estimated. Consider them when с=0. Then (12) and (13) converge in 
probability to Ф(- $A); that suggests b( — 1D) as an estimator of (12) and 
(13). A better estimator is @(— 10), where D? = (n ~p — 1)D?/n, which 
is closer to being an unbiased estimator of A^ [See (4).] McLachlan 
(1973, 1974а, 1974b, 1974c) gave an estimator of (12) whose bias is of order 
пт; it is 
(14) (-4D)+ san) > + [рз + 4(4p — 02] 

2 22) МБ + 32 : 
[McLachlan gave (14) to terms of order n^'.] McLachlan explored the 
properties of these and other estimators, as did Lachenbruch and Mickey 
(1968). 

Now consider (12) with c = Du, + 10°; и, might be chosen to control 
P(2] 1) conditional on x, x, S. This conditional probability as a function of 
x!) x, $ is a random variable whose distribution may be approximated. 
McLachlan showed the following: 


Theorem 6.6.4. As М, > oo, М, > oo, and N,/N, > a positive limit, 
Р(211, Du, + 50°, х, 7, S) - (ui) 
MM €x 
Ф(из) [sur пим] 


(p - Dn/N, - (p - i * n/N)u, 01/4 
Vn [5u? пим | 


(15) "b 


-2b0x- 


+ O(n^?). 
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McLachlan (1977) gave a method of selecting u, so that the probability of 


one misclassification is less than a preassigned 8 with a preassigned confi- 
dence level 1 - =. 


6.6.2. Asymptotic Expansions of the Probabilities of Misclassification 
Using Z 


We now turn our attention to Z defined by (32) of Section 6.5. The results 
are parallel to those for W. Memon and Okamoto (1971) expanded the 


distribution of Z to terms of order n^?, and Siotani and Wang (1975), (1977) 
to terms of order п 2, 


Theorem 6.6.5. As №, > oo, №, > оо, and М, /N, approaches a positive 


limit, 
— 1,2 
(16) {2 24 <и d 
| 1 
=ou) -ou аи + (p - 3)u- A] 
+ 1 [u + Au? + (p-3- А2)и - 8 – A] 


2№,А 
; 
+ Gq [4 +4Au? + (бр-6+№)и+2(р- nal} +0(п-?), 
and Pr( (Z + 5А?) /А <ul m} is (16) with М, and №, interchanged. 
When c = 0, then и = — 2A. If М, = №,, the rule with Z is identical to the 
rule with W, and the probability of misclassification is given by (2). 


Fujikoshi and Kanazawa (1976) proved 


Theorem 6.6.6 


— 1p? 
an RES. 


= (0) - 64 [к + Аи (p - 1) 


- х8 *2Au ep - 1+4] 


ease + (4p -3)u]} +0(п7?), 
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ZtiD! 
(18) Р-Р <u 


- 


=Ф(и) = (4) ~ s +2Аи ep - 1|] 


1 2 
tapal + Au- (p= D] + gg [w + p 3)0]) +001). 


Kanazawa (1979) showed the following: 


Theorem 6.6.7. Let uy be such that ®(u,) = a, and let 


19 = 1 
(19) u uy + zN p [60 + Duy ~ (p 1)] 


1 
- ZN; [м3 + Du + (p-1)-D?] 
1 
+ an là + (4p - 5) из]. 
Then as N, > со, № > оо, and N,/N, > а positive limit, 


(20) e| 253 <u}=a+0(n™). 


; Now consider the probabilities of misclassification after the samples have 
een drawn. The conditional distribution of Z is not normal; Z is quadratic 


in x unless N, = N}. We do not have expressi i 
xu . essions 
Siotani (1980) showed the following: - P cauivatent to (12) and (3) 


Theorem 6.6.8. As №, > оо, №, > oo, and №, /№, > а positive limit, 


(21) no NN, P(211,0,39, 39, $) - &(- 34) 
MEN 94) < 


NN. 1 
=ф| х 1*2 
[e-a (зар 0 ал 


1 - 
+ av; (0 - 1) +34?] - o £ O(n7). 


It is also ib i imi i 
1p? go sO D е M obtain a similar expression for P(2|1, Du, + 
;D^, x), x(9, or Z and a confidence interval. See Siotani (1980). 
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6.7. CLASSIFICATION INTO ONE OF SEVERAL POPULATIONS 


Let us now consider the problem of classifying an observation into one of 
several populations. We shall extend the consideration of the previous 
sections to the cases of more than two populations. Let ту»... › Tm be т 
populations with density functions p,(x), .. ., Ры(х), respectively. We wish to 
divide the space of observations into т mutually exclusive and exhaustive 
regions Ry,...,R,,- If an obser vation falls into R;, we shall say that it comes 
from т. Let the cost of misclassifying an observation from 77; as coming from 
п; be CCili). The probability of this misclassification 1s 


(1) | Pli, В) = f pi(x) dx. 


Suppose we have a priori probabilities of the populations, q,,...,q,. Then 
the expected loss is 


i- j=l 
j¥i 


m nm 
(2) «| аллу 
1 
We should like to choose R,,..., К» to make this a minimum. 

Since we have a priori probabilities for the populations. we can define the 
conditional probability of an observation coming from a population given 
the values of the components of the vector x. The conditional probability of 
the observation coming from т; is 


gpl) 
3 ие. 
(3) т 1, АСР) 


If we classify the observation as from ту, the expected loss is 


т 


рх) li 
(4) r Eride p) cuin. 


ij 


We minimize the expected loss at this point if we choose j so as to minimize 
(4); that is, we consider 


n 


(5) У gpl xc) 
i=l 
i*j 
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for all j and select that j that gives the minimum. (If two different indices 
give the minimum, it is irrelevant which index is selected.) This procedure 
assigns the point x to one of the R,. Following this procedure for each x, we 
define our regions R,,..., Rm. The classification procedure, then, is to 
classify an observation as coming from 7, if it falls in R,. 


Theorem 6.7.1. Га; is the a priori probability of drawing an observation 
from population т; with density р(х), i= 1,. .., m, and if the cost of misclassify- 


ing an observation from п; as from т; is C(jli), then the regions of classifica- 


tion, R,,..., Rm, that minimize the expected cost are defined by assigning x to 
R, if 
(9 XanQ)CQu« Хари, ]=1,...т, j*k. 
i=l 
i#k 1 


[If (6) holds for all j (j + К) except for h indices and the inequality is replaced by 
equality for those indices, then this point can be assigned to any of the h + 1 т°.) 
If the probability of equality between the right-hand and left-hand sides of (6) is 
zero for each К and j under т; (each i), then the minimizing procedure is unique 
except for sets of probability zero. 

Proof. We now verify this result. Let 


(7) һ(х) = Y ap()Cli). 
i=} 


Е) 


Then the expected loss of a procedure R is 
m a 

(8) У J һо) & = [| AGAR) dx, 
gals 


where (| R) = h;(x) for x in Rj. For the Bayes procedure R* described in 
the theorem, А(х| К) is A(x|R*)= min, h(x). Thus the difference between 
the expected loss for any procedure R and for R* is 


(9) ЈАК) — AGR*)] dx = Y f [AG = min 100] de 
ј 


> 0. 


Equality can hold only if Ах) = тіп, А, (х) for x іп К, except for sets of 
probability zero. E 
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Let us see how this method applies when C(jli) = 1 for all ; and j, i sj. 
Then in R; 


(10) ' E ар (х) < Хара), j*k. 
ik E" 


Subtracting Y 1 is к РКХ) from both sides of (10), we obtain 
(11) р(х) « qup. (x), j*k. 


In this case the point x is in R, if k is the index for which q;p;(x) is a 
maximum; that is, 7, is the most probable population. 

Now suppose that we do not have a priori probabilities. Then we cannot 
define an unconditional expected loss for a classificadion procedure. How- 
ever, we can define an expected loss on the condition that the observation 
comes from a given population. The conditional expected loss if the observa- 
tion is from т; is 


(12) | X C( li) P( li, В) =r(i, R). 
j=l 


j*i 


A procedure К is at least as good as R* if r(i, К) &r(i, R*), i=1,...,m; R 
is better if at least one inequality is strict. R is admissible if there is no 
procedure R* that is better. А class of procedures is complete if for every 
procedure К outside the class there is a procedure R* in the class that is 
better. 


Now let us show that a Bayes procedure is admissible. Let R be a Bayes 


` procedure; let R* be another procedure. Since R is Bayes, 


(13) Y ar( R) = Хань R*). 
i=l i=] 


Suppose q, > 0, q, > 0, r(2, R*) <r(2, R), and r(i, R*) < r(i, К), i=3,...,m. 
Then 


(14) 411-01, В) – (1, R*)] < Ealt, R*) - r(i, В)| «0, 


and r(1, R) « r(1, R*). Thus R* is not better than R. 


Theorem 6.7.2. Јад; > 0, = 1,...,т, then а Bayes procedure is admissi- 
ble. 
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и M shall now assume that СД) = 1, i 7j, and Pr{ p(x) = 0| 7} = 0. The 
atter condition implies that all р;(х) are positive on the same set (exce t fer 
a set of measure 0). Suppose q, — 0 for i= 1,...,1, and q; > 0 for i-is 
1,..., m. Then for the Bayes solution R, = 1,...,/, is empt (except f 

а set of probability 0), as seen from (11) [that is, р, (х)=0 for x in Rl 
It follows that г@, В) = Lei PCUli, R) = 1- Pili, R) =1 for i= " " 
Then (R,,,...,R,) is a Bayes solution for the problem involving 
Р,+10х),...,р„(х) and q,,,,..., q,,. It follows from Theorem 6.72 that н 
procedure А* for which P(ili, R*) = 0, = 1,...,/, can be better than the 
Bayes procedure. Now consider a procedure R* such that R* includes a г 
of positive probability so that Р(111, В*) > 0. For R* to be better than R 


(15) P(ili, R) = f р(х) dx 
R; 
< Piili R*) = А 
< P(ili, R*) Јр) ds. i-2,...,m. 
In such a case a procedure R** where R** is empty, i = 1 t, RF = RE 


i=t+1,...,.m-—1 ** =) i 
a т-Ъ and Ri" =R*URTU- UR* would give risks such 
P(ili; R**) — 0, 
(16) P(ili, R**) = P(ili, RF) = P(ili, К), 
P(m|m, R**) > P(m|m, R*) > P(mlm, К). 


Then Ки... RS) would be better than (R,,,,...,R,,) for the (m — t)- 
decision problem, which contradicts the preceding discussion. 


Theorem 6.7.3. If C(ilj)=1,i#j, and P = 
procedure is admissible. ^ "pd Oe) = 0, then a Bayes 


The Converse 1s true without con ons (exc р р р 
diti ( xcept that the arameter space 
15 finite). 


Theorem 6.7.4. Every admissible procedure is a Bayes procedure. 


Е We shall not prove this theorem. It is Theorem 1 of Section 2.10 of 
rorguson (1967), for example. The class of Bayes procedures is minimal 
complete if each Bayes procedure is unique (for the specified probabilities). 


The minimax procedure is 
th : . 
equal. P e Bayes procedure for which the risks are 
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There are available general treatments of statistical decision procedures 
by Wald (1950), Blackwell and Girshick (1954), Ferguson (1967), De Groot 
(1970), Berger (1980b), and others. 


6.8. CLASSIFICATION INTO ONE OF SEVERAL MULTIVARIATE 
NORMAL POPULATIONS 


We shall now apply the theory of Section 6.7 to the case in which each 
population has a normal distribution. [See von Mises (1945).] We assume that 
the means are different and the covariance matrices are alike. Let N(p®, £) 
be the distribution of 7;. The density is given by (1) of Section 6.4. At the 
outset the parameters are assumed known. For general costs with known 
a priori probabilities we can form the т functions (5) of Section 6.7 and 
define the region R, as consisting of points x such that the jth function is 
minimum. 

In the remainder of our discussion we shall assume that the costs of 
misclassification are equal. Then we use the functions 


(1) ug(x) = log P -[x-i( + pe) E (a? = Wy, 


1 a priori probabilities are known, the region К, is defined by those x 
satisfying 


(2) Ry u(x) > log k=1,... m. k*j. 
J 


Theorem 6.8.1. Jf q; is the a priori probability of drawing an observation 
from т, = N(w, X), i= 1,..., m, and if the costs of misclassification are equal, 
then the regions of classification, R,,..-1 Rms that minimize the expected cost are 
defined by (2), where uj (x) is given by (1). 


It should be noted that each и бх) is the classification function related to 
the jth and kth populations, and U(x) = ш, (x). Since these are linear 
functions, the region R; is bounded by hyperplanes. If the means span an 
(m — 1)-dimensional hyperplane (for example, if the vectors p? are linearly 
independent and p > т — 1), then R; is bounded by m — 1 hyperplanes. 

In the case of no set of a priori probabilities known, the region К; is 
defined by inequalities 


(3) U(x) z 6; - С, k=1,...,m, k*j. 
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The constants c, can be taken nonnegative. These sets of regions form the 
class of admissible procedures. For the minimax procedure these constants 
are determined so all Pili, В) are equal. 

We now show how to evaluate the probabilities of correct classification. If 
X is а random observation, we consider the random variables 


(4) и = [х- 5 (pi? + и) x (io? _ e, 


Неге О, = —U,;. Thus we use m(m — 1)/2 classification functions if the 
means span an (т — 1)-dimensional hyperplane. If X is from т, then U; is 
distributed according to N(5A%,, A^), where 


(5) = (a? — DYE (nf? — p”). 
The covariance of И, and Uj, is 
(6) Ag (gp? - OJE (a? - но). 


To determine the constants c; we consider the integrals 


о РОЛ. в) = f 


Cj7 65 


x 
Я! NL "du; у-шу 7 du; 
3761 


where f, is ће density of U;, i= 12,..,m,i*j. 


Theorem 6.82. If m, is Мы, X) and the costs of misclassification are 
equal, then the regions of classification, В,,..., Rm, that minimize the maximum 
conditional expected loss are defined by (3), where uy (x) is given by (1). The 
constants c; are determined so that the integrals (7) are equal. 


As an example consider the case of т = 3. There is no loss of generality in 
taking p — 2, for the density for higher p can be projected on the two-dimen- 
sional plane determined by the means of the three populations if they are not 
collinear (i.e., we can transform the vector x into ир, Из, and p — 2 other 
coordinates, where these last p — 2 components are distributed indepen- 
dently of u, and из and with zero means). The regions R; are determined 
by three half lines as shown in Figure 6.2. If this procedure is minimax, we 
cannot move the line between R, and R, rearer ( 4”, u$), the line between 
К, and R, nearer (pi, u9’), and the line between R, and R, nearer 
(м, uP) and still retain the equality Р(111, В) = PQ12, В) = PGI3, К) 
without leaving a triangle that is not included in any region. Thus, since the 
regions must exhaust the space, the lines must meet in a point, and the 
equality of probabilities determines c; — 6; uniquely. 
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x2 


а 1 
са (am, ug 


@ 0 77 
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Figure 6.2. Classification regions. 


To do this in a specific case in which we have numerical values for the 
components of the vectors p”, и), p®, and the matrix €, we would con- 
sider the three (<р +1) joint distributions, each of two U,,’s (j €i). We 
could try the values of c,;=0 and, using tables [Pearson (1931)] of the 
bivariate normal distribution, compute P(i|i, А). By a trial-and-error method 
we could obtain c; to approximate the above condition. 

The preceding theory has been given on the assumption that the parame- 
ters are known. If they are not known and if a sample from each population 
is available, the estimators of the parameters can be substituted in the 
definition of и, (х). Let the observations be x(2,..., xf. from Ми, X), 
i=1,...,m. We estimate p^ by | 


(8) pO = <. x 
and X by 5 defined by 
m m N 
(9) [Ex-n|s- Y X (x9-x9)(z-x9y. 
Then, the analog of и, (х) is 
(10) (х) = [x- К  x9»| s^ (x0 — x0). 
If the variables above are random, the distributions are different from those 


of U,,. However, as М, > оо, the joint distributions approach those of U;; 
Hence, for sufficiently large samples one can use the theory given above. 


240 


Table 6.2 
Mean 

Brahmin Artisan Korwa 
Measurement (7) (75) (74) 
Stature (ху) 164.51 160.53 158.17 
Sitting height (хо) 86.43 81.47 81.16 
Nasal depth (x4) 25.49 23.84 21.44 
Nasal height (x4) 51.24 48.62 46.72 


6.9. AN EXAMPLE OF CLASSIFICATION INTO ONE OF SEVERAL 
MULTIVARIATE NORMAL POPULATIONS 


Rao (1948a) considers three populations consisting of the Brahmin caste 
(зт), the Artisan caste (л), and the Korwa caste (лз) of India. The 
measurements for each individual of a caste are stature (x), sitting height 
(x5), nasal depth (x4), and nasal height (x,). The means of these variables in 


the three populations are given in Table 6.2. The matrix of correlations for 
al! the populations is 


. 1.0000 0.5849 0.1774 0.1974 

(1) 0.5849 1.0000 0.2094 0.2170 
0.1774 0.2094 1.0000 0.2910 | 

0.1974 0.2170 0.2910 1.0000 


The standard deviations are о; = 5.74, с, = 3.20, оз = 1.75, о = 3.50. We 
assume that each population is normal. Our problem is to divide the space of 
the four variables хү, х,, хз, х, into three regions of classification. We 
assume that the costs of misclassification are equal. We shall find (i) a set of 
regions under the assumption that drawing a new observation from each 
population is equally likely (9; =q, = 4, = Г), and (ii) a set of regions such 
that the largest probability of misclassification is minimized (the minimax 
solution). 


We first compute the coefficients of E^!(j/? — и) and €^! (p? — р). 
Then (и — px) =Z (p — pO) — E Кро) — рО). Then we calcu- 
late $(p + uy €-1 — x), We obtain the discriminant functions! 


из(х) = —0.0708x, + 0.4990х, + 0.33734 + 0.0887x, — 43.13, 
(2) из(х)= 0.0003х, + 0.3550x, + 1.1063x, + 0.1375x, — 62.49, 
из(х) = 0.0711х, — 0.1440х, + 0.7690x, + 0.0488x, — 19.36. 


tDue to an error in computations, Rao's discriminant functions are incorrect. I am indebted to 
Mr. Peter Frank for assistance in the computations. 
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Table 6.3 
Standard 
Population of x и Means Deviation Correlation 
Ti ир 1.491 1.727 0.8658 
uy 3.487 2.641 i 
T, zy 1.491 1.727 _ 93894 
и 1.031 1.436 
T3 Из 3.487 2.641 0.7983 
из 1.031 1.436 UU 
‘The other three functions are и, (х) = —и12(х), ug) = -u(x), and 
из (х) = — ug (x). If there are a priori probabilities and they are equal, the 


best set of regions of classification are Ry: uj (x) > 0, uy (x) > 0; Rs: 
ибх) 20, us (x) > 0; and Ry: us z 0, us (x) > 0. For example, if we 
obtain an individual with measurements x such that uj (x) > 0 and uj (x) > 
0, we classify him as a Brahmin. 

To find the probabilities of misclassification when an individual is drawn 
from population я, we need the means, variances, and covariances of the 
proper pairs of и’з. They are given in Table 6.3.1 

The probabilities of misclassification are then obtained by use of the 
tables for the bivariate normal distribution. These probabilities are 0.21 for 
ту, 0.42 for т», and 0.25 for тз. For example, if measurements аге made on 
a Brahmin, the probability that he is classified as an Artisan or Korwa is 0.21. 

The minimax solution is obtained by finding the constants c}, со. and с; 
for (3) of Section 6.8 so that the probabilities of misclassification are equal. 
The regions of classification are 


Вт: и12(х) > 0.54, из(х) > 0.29; 


(3) К: из (х) > — 0.54, 
Ку: u(x) > --0.29, 


ug (x) > — 0.25; 


из(х)> 0.25. 


The common probability of misclassification (to two decimal places) is 0.30. 
Thus the maximum probability of misclassification has been reduced from 
0.42 to 0.30. 


tsome numerical errors in Anderson (1951a) are corrected in Table 6.3 and (3). 
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6.10. CLASSIFICATION INTO ONE OF TWO KNOWN MULTIVARIATE 
NORMAL POPULATIONS WITH UNEQUAL COVARIANCE MATRICES 
6.10.1. Likelihood Procedures 


Let ту and m, be N(p®, £,) and N(w, £3) with pO # pO and X, + X,. 
When the parameters are known, the likelihood ratio is р 


P(x) _ [E21 ехр| – (х= p?) E(x- y?) 
р(х) TE Repl- 3G uy xs (x - f] 


(1) 


= |У, 51 "ep[i(x- pO y X;!(x- gu?) 


аву (e a). 


The logarithm of (1) is quadratic in x. The probabilities of misclassification 
are difficult to compute. [One can make a linear transformation of x so that 
its covariance matrix is Z and the matrix of the quadratic form is diagonal; 
then the logarithm of (1) has the distribution of a linear combination of 
al x^-variables plus a constant.] 
re When the parameters are unknown, we consider the problem as testing 
the hypothesis that x, х{",..., xf are observations from Мы X9 and 
x, .... x? are observations from M(w, X) against the alternative that 
x(P, .... xf? are observations from Миг, У) апі x, xf o XN) are obser- 
vations trom Ми, X.,). Under the first hypothesis the maximum likelihood 
estimators are fil!) = (№0 + х)/(М, + D, BP =x, 


$ 1 Ni — 2) —gy0y 

(0) = т 41+ wail? (х= х) |, 
(2) . | 

EI) TUN A», 


where A; = E. (xf? — xxt? — x€)', i= 1,2. (See Section 6.5.5.) Under 

i a= Ы а В . В ^ 1 —z 
the second hypothesis the maximum likelihood estimators are f? = x, 
AP = (М.Я +x)/(N, + 1), 


(3) 


$ № = 72 — x2)’ 
(2) = NET Ast тб x?)(x-xV)|. 
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The likelihood ratio criterion is 


(4) [$,(2)5^|$,(2)| 224 D [1 +(x - EO) Ar (xo gem 
I2, (1) +$ (1)]iv; р " (x -x9) Aj! (x — #0) t? 
(N+ 1) 201+ De АРА 
М2 №, + 1) 2+0 4 | . 


The observation х 15 classified into ту if (4) is greater than 1 and into т, if 
(4) is less than 1. 


An alternative criterion is to plug estimates into the logarithm of (1). Use 
(5) SG EOS (x 89) - (5-20) 5 (1-20) 


to classify into m, if (5) is large and into т, if (5) is small. Again it is difficult 
to evaluate the probabilities of misclassification. 


6.10.2. Linear Procedures 


The best procedures when X, + X, are not linear; when the parameters are 
known, the best procedures are based on a quadratic function of the vector 
observation x. The procedure depends very much on the assumed normality. 
For example, in the case of р = 1, the region for classification with one 
population is an interval and for the other is the complement of the interval 
—that is, where the observation is sufficiently large or sufficiently small. In 
the bivariate case the regions are defined by conic sections; for examole, the 
region of classification into one population might be the interior of an ellipse 
or the region between two hyperbolas. In general, the regions are defined by 
means of a quadratic function of the observations which is not necessarily a 
positive definite quadratic form. These procedures depend very much on the 
assumption of normality and especially on the shape of the normal distribu- 
tion far from its center. For instance, in the univariate case cited above the 
region of classification into the first population is a finite interval because the 
density of the first population falls off in either direction more rapidly than 
the density of the second because its standard deviation is smaller. 

One may want to use a classification procedure in a situation where the 
two populations are centered around different points and have different 
patterns of scatter, and where one considers multivariate normal distribu- 
tions to be reasonably good approximations for these two populations near 
their centers and between their two centers (though not far from the centers, 
where the densities are small). In such a case one may want to divide the 


244 CLASSIFICATION OF OBSERVATIONS 


sample space into the two regions of classification by some simple curve or 
surface. The simplest is a line or hyperplane; the procedure may then be 
termed linear. 

Let b (# 0) be a vector (of p components) and c a scalar. An observation 
x is classified as from the first population if b’x > c and as from the second if 
b'x«c. We are primarily interested in situations where the important 
difference between the two populations is the difference between the cen- 
ters; we assume p” + p as well as €, X,, and that X, and X, аге 
nonsingular. 

When sampling from the ith population, b'x has a univariate normal 
distribution with mean &(b'x|7;) = bp and variance 


(6) é(b'x— ьо) п = éb'(x — pw) (x — p)'b| m; = b'X;b. 


The probability of misclassifying an observation when it comes from the first 
population is 
= 


The probability of misclassifying an observation when it comes from the 
second population is 
sj 


It is desired to make these probabilities small or, equivalently, to make the 
arguments 


ty — py — р’ 
(7) P(2\1) =Pr{b'x <сіт,} = pe Pewee PaL M 


(MX) (ББ)? 
СЕВЕ и 
(&'Z,b): (bE)? 


b'x—-b'y? > f= b'y? 
ЫОд el H 


(8) PAID = Pro's en) =P (FX, (rx, 


һо 
= 1 -e[ eee 
(5'x,b) 


c— bp? 
У = 


bp) —с 
(9) к TM AT 
(2'5,ь)? 


(ву 


large. We shall consider making y, large for given у). 
When we eliminate c from (9), we obtain 


м 


(10) у= [bv - y, (7,5) |в), 
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where y = p” — pO. To maximize y, for given y, we differentiate y, with 
respect to b to obtain 


9 -4 PEN 
(11) 22 = [v -».(z 5) E b| ox 
~[b'y - y, (x Бу |х) Eb. 
If we let 
b'y —y(b'X,b) 
(12) и = УВ 


y» 
=, 
(9) ese 
then (11) set equal to 0 is 
(14) (5X, 0 65X)b = y. 


Note that (13) and (14) imply (12). If there is a pair t,,t,, and a vector b 
satisfying (12) and (13), then c is obtained from (9) as 


(15) СЕВ + bp — tb E b+ bp. 
Then from (9), (12), and (13) 
b'a- (tb X,b + bp”) 


(16) у= УРУ в -uyP X. 


Now consider (14) as a function of г (0 <t < 1). Let t, ^t and г, = 1 ~ г; 
then b = (E, 5X) ! y. Define o, = tjyb'Xjb and v, = ty b'X,b. The 
derivative of v? with respect to £ is 


an Ley fee, «0-02, x[X «0-0X]^v 

= 2ry'[tE,+(1-£)E,] XX, 90-02; v 
-Py [tE --0xj (x -EX)0[pX -0-0xX]^ 

E [pE + (1-15, ‘у 
- ух + (1-05, mr [xoe0-0xXj]^ 

(X, -E [E + (17 0X.] ‘у 
sey [+ (1-07 (xipx + 0-70x4 E 

+ [+ 01-0), |5, + 0-70X.] "v 


by the following lemma. 
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Lemma 6.10.1. МУ, and X, are positive definite and t, > 0,1, > 0, then 

(18) E[n thE] ^X 

is positive definite. 


Proof. The matrix (18) is 
(19) E [E (AE +), | E = (427) 45Xj) . L| 


Similarly du3/dt < 0. Since v, > 0, v; > 0, we see that v, increases with t 
from 0 at г=0 to y y'X;! y at ¢=1 and v, decreases from yvY'Zilv at 
1= 0 to Oat ¢= 1. The coordinates v, and v, are continuous functions of t. 
For given уз, 0 <, < V Y'Es! y, there is a t such that y, =v, = t,yb'£,b 
and b satisfies (14) for г, =t and t; = 1— t. Them y, =v, = t, /b'£X,b mazi- 
mizes y, for tlat value of у,. Similarly given y}, 0 xy, < y Y'Ei!v, there is 
a Г such that y, =v, — t, /b'X,b and b satisfies (14) for ¢, — t and t; —1 — t, 
and y, = o, = t / b'E,b maximizes уз. Note that y, z 0, y; z 0 implies the 
errors of misclassification are not greater than 7. 

We now argue that the set of у, у, defined this way correspond to 
admissible linear procedures. Let x,, x, be in this set, and suppose another 
procedure defined by z}, z; were better than x, хз, that is, ху € zy, X2 X 2> 
with at least one strict inequality. For y, —z, let y? be the maximum y; 
among linear procedures; then z, = у), z} xy$ and hence x, <y,, x; xy. 
However, this is possible only if x, = y, x; =y}, because dy, /dy, < 0. Now 
we have a contradiction to the assumption that 21, 2; was better than x,, x;. 
Thus ху. х. corresponds to an admissible linear procedure. 


Use of Admissible Linear Procedures 

Given t, and f, such that 5E, +1, is positive definite, one would 
compute the optimum b by solving the linear equations (15) and then 
compute c by one of (9). Usually ^, and 1; аге not given, but a desired 
solution is specified in another way. We consider three ways. 


Minimization of One Probability of Misclassification for a Specified 

Probability of the Other 

Suppose we are given y, (or, equivalently, the probability of misclassification 
when sampling from the second distribution) and we want to maximize y, 
(or, equivalently, minimize the probability of misclassification when sampling 
from the first distribution). Suppose у, > 0 (ie. the given probability of 
misclassification is less than 1). Then if the maximum y, > 0, we want to find 
t, — 1 — t, such that y; = t (5X, b)*, where b = [5 E, + t X;] tY. The solu- 
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tion can be approximated by trial and error, since у, is an increasing function 
of £. For t; =0, y, = 0; and for t, = 1, y; = (b'Z,b)! = (Бу)! = (y'25!y), 
where Zjb = y. One could try other values of t; successively by solving (14) 
and inserting in 5'X,b until t,(b'X,b)? agreed closely enough with the 
desired y;. [y, > 0 if the specified y; < (q'Z7!4)2.] 


The Minimax Procedure 
The minimax procedure is the admissible procedure for which y, = у,. Since 
for this procedure both probabilities of correct classification are greater than 


3, we have y, =y, > 0 and f, > 0, t, > 0. We want to find t (91—1-1,)so 
that 


(20) 0-y2—y2-Ub'X,b- (1-:ybX,b 
= [2х - (1-1), |. 


Since y? increases with t and y3 decreases with increasing f, there is опе and 
only one solution to (20), and this can be approximated by trial and error by 
guessing a value of t (0<:<1), solving (14) for b, and computing the 
quadratic form on the right of (20). Then another ¢ can be tried. 


An alternative approach is to set y, =y, in (9) and solve for c. Then the 
common value of у, = y, is 


(21) — by 
(b'E,b)? + (bX,b)!' 


and we want to find b to maximize this, where b is of the form 


(22) [131+ (1-0X;] y 


with 0 « t « 1. 
_ When E, X, twice the maximum of (21) is the squared Mahalanobis 
distance between the populations. This suggests that when У, may be 


unequal to %,, twice the maximum of (21) might be called the distance 
between the populations. 


Welch and Wimpress (1961) have programmed the minimax procedure 
and applied it to the recognition of spoken sounds. 


Case of A Priori Probabilities 


Suppose we are given a priori probabilities, q, and q,, of the first and second 
populations, respectively. Then the probability of a misclassification is 


Q3) 4,[1-®(y,)] +q,[1- ®(y,)] =1- [4.Ф( у) * q,9(y;)]. 
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which we want to minimize. The solution will be an admissible linear 
procedure. If we know it involves y, > 0 and y,>0, we can substitute 
yi 7 КЬ" Б) and уз = (1 2 t6'Z, b), where b —[t, + (1 - 2X, ]-! v, 
into (23) and set the derivative of (23) with respect to t equal to 0, obtaining 


d d 
(24) no) 5 +4909) F - 0, 


where ф(и) = (2т)- le^ ?*. There does not seem to be any easy or direct 
way of solving (24) for Е. The left-hand side of (24) is not necessarily 
monotonic. In fact, there may be several roots to (24). If there are, the 
absolute minimum will be found by putting the solution into (23). (We 
remind the reader that the curve of admissible error probabilities is not 
necessary convex.) 

Anderson and Bahadur (1962) studied these linear procedures in general, 
including y, «0 and у, <0. Clunies-Ross and Riffenburgh (1960) ap- 
proached the problem from a more geometric point of view. 


PROBLEMS 


6.1. (Sec. 6.3) Let т, be Ми, Х,), i = 1,2. Find the form of the admissible 
classification procedures. 


6.2. (Sec. 6.3) Prove that every complete class of procedures includes the class of 
admissible procedures. 


6.3. ‘Sec. 6.3) Prove that if the class of admissible procedures is complete, it is 
minimal complete. 


6.4. (Scc. 6.3) The Neyman-Pearson fundamental lemma states that of all tests at a 
given significance level of the null hvpothesis that x is drawn from p(x) 
against alternative that it is drawn from p;(x) the most powerful test has the 
critical region p,(x)/p2(x) < К. Show that the discussion in Section 6.3 proves 
this result, 


6.5. (Sec. 6.3) When р(х) = по, X) find the best test of и = 0 against p = p* 
at significance level e. Show that this test is uniformly most powerful against all 
alternatives р, = си", с > 0. Prove that there is no uniformly most powerful test 
against р = pO and р = p® unless p” = cp for some c > 0. 


6.6. (Sec. 6.4) Let P(2|1) and Р(112) be defined by (14) and (15). Prove if 
— 24? «c « 3d”, then Р(211) and P(1|2) are decreasing functions of А. 


6.7. (Sec. 6.4) Let х’ = (x(?', x9), Using Problem 5.23 and Problem 6.6, prove 
that the class of classification procedures based on x is uniformly as good as 
the class of procedures based оп x“, 
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6.8. (Sec. 6.5.1) Find the criterion for classifying irises as Jris setosa or Iris 
versicolor on the basis of data given in Section 5.3.4. Classify a random sample 
of 5 Iris virginica in Table 3.4. 


6.9. (Sec. 6.5.1) Let W(x) be the classification criterion given by (2). Show that the 


T?-criterion for testing N(u(?, X) = МО), X) is proportional to W(X") and 
(х0). 


6.10. (Sec. 6.5.1) Show that the probabilities of misclassification of x,,..., хм (all 
assumed to be from either т; or т,) decrease as № increases. 


6.11. (Sec. 6.5) Show that the elements of M are invariant under the transforma- 
tion (34) and that any function of the sufficient statistics that is invariant is à 
function of M. : 


6.12. (Sec. 6.5) Consider d'x(?. Prove that the ratio 


(d'xt -dzy 
N; М: 
Y, (40-49%) + У (40-49) 
а=1 


а=1 


4 


6.13. (Sec. 6.6) Show that the derivative of (2) to terms of order n^! is 


[1 1Гр-т о р-2 р, 
-acaz + z [5s +22 - al}. 


6.14. (Sec. 6.6) Show &D? is (4). [ Hint: Let X = Г and show that Z(S^ | =) = 
[ип -p - 0H] 


6.15. (Sec. 6.6.2) Show 


—ip? Z-iWM 
{7 suni) — Prf A <и 


= sof NE [из + (р – 3)и – Аи + p^| 


1 


+ 
2А 


[u + 2ди? + (р- 3 + А?) - А? pa] 


+ [ne + 4Аи? + (2р-з+А?)и+2(р- nal} + O(n^7). 


6.16. (Sec. 6.8) Let m; be N(p(?, X) і= 1....,т. If the p? are on a line (ie.. 
yw = u +v,B), show that for admissible procedures the R; are defined by 
parallel planes. Thus show that only one discriminant function uj, (x) need be 
used. 
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6.17. 


6.18. 


6.19. 


6.20. 
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(Sec. 6.8) In Section 8.8 data are given on samples from four populations of 
skulls. Consider the first two measurements and the first three samples. 
Construct the classification functions и; (х). Find the procedure for 4; = 
NN, + N, + №). Find the minimax procedure. 


'x=ci i f a plane that is tangent to an 
(Sec. 6.10) Show that b'x=c is the equation of a 
ellipsoid of constant density of , and to an ellipsoid of constant density of 7; 
at a common point. 


(Sec. 6.8) Let xf?,..., х® be observations from Мур, X), i- 1,2,3, and let 
x be an observation to be classified. Give explicitly the maximum likelihood 
rule. 


(Sec. 6.5) Verify (33). 


CHAPTER 7 


The Distribution of the Sample 
Covariance Matrix and the 
Sample Generalized Variance 


7.1. INTRODUCTION 


The sample covariance matrix, 5 = [1/(N — 15" (х, — xx, — x)’, is an 
unbiased estimator of the population covariance matrix X. In Section 4.2 we 
found the density of А = (№ — 1)5 in the case of a 2 х 2 matrix. In Section 
7.2 this result will be generalized to the case of a matrix А of any order. 
When Х = І, this distribution is in a sense a generalization of the x?-distri- 
bution. The distribution of A (or S), often called the Wishart distribution, is 
fundamental to multivariate statistical analysis. In Sections 7.3 and 7.4 we 
discuss some properties of the Wishart distribution. 

The generalized variance of the sample is defined as |5| in Section 7.5; it 
is a measure of the scatter of the sample. Its distribution is characterized. 
The density of the set of all correlation coefficients when the components of 
the observed vector are independent is obtained in Scction 7.6. 

The inverted Wishart distribution is introduced in Section 7.7 and is used 
as an a priori distribution of X to obtain a Bayes estimator of the covariance 
matrix. In Section 7.8 we consider improving on 5 as an estimator of X; with 
respect to two loss functions. Section 7.9 treats the distributions for sampling 
from elliptically contoured distributions. 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 
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7.3. THE WISHART DISTRIBUTION 


We shall obtain the distribution of 4= EN (X, XXX, – X), where 
X3,..., Xy (М> p) are independent, each with the distribution N(p, X). As 
was shown in Section 3.3, A is distributed as Y". .,Z, Z}, where п=М-1 
and Z,,...,Z, are independent, each with the distribution N(0, X). We shall 
show that the density of А for А positive definite is 


[4] 1р1) exp( — jtr $'A) 
22"РтР-2/4 |у "Р Г[1(п+1- i)]' 


(1) 


We shall first consider the case of E = Г. Let 
(2) (2,,...,2,) = 


Then the elements of A=(a,,) are inner products of these n-component 
vectors, a;; = vjv;. The vectors v;,...,v, are independently distributed, each 
according to N(0, 1,). It will be convenient to transform to new coordinates 
according to the Gram-Schmidt orthogonalization. Let w; = v, 


(3) w-wu-Y wl, . i-2,..., p. 
j 


We prove by induction that w, is orthogonal to w;, К <i. Assume ww, = 0, 
k * h, k,h — 1,...,i — 1; then take the inner product of w, and (3) to obtain 
ную, = 0, k = 1,...,i— 1. (Note that Pr{llw;]| = 0) = 0.) 

Define г, = llw; = Vwiw; , i=1,..., p, and ую, Л», j — 1,...,i— 1, 
і= 2,...,р. Since v, = Li. (Люд, 


min(h, i) 
(4) а= У ци. 

„J51 
If we define the lower triangular matrix T = (1,) with t; > 0, i= 1,..., p, and 
t; 7 0, i<j, then 


(5) A-TT'. 


Note that lj j=1,...,i-1, are the first i—1 coordinates of v; in the 
coordinate system with w,,...,w;_, as the first i — 1 coordinate axes. (See 
Figure 7.1.) The sum of the other n — i + 1 coordinates squared is |lx,|l? -- 
лир = 12 = will; w; is the vector from v; to its projection on w,,...,w;., 
(or equivalently on v,,...,v; ). 
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Figure 7.1. Transformation of cocrdinates. 


Lemma 7.2.1. Conditional on w,,.. . ,w;.., (or equivalently on w,,.. ., vj .,). 
fa; ... fi, and tz are independently distributed; t;; is distributed according to 
N(0, D, i» j; and t} has the. (?^-distribution with n — i + 1 degrees of freedom. 


Proof. The coordinates of v; referred to the new orthogonal coordinates 
with »,,...,v;., defining the first coordinate axes are independently nor- 
mally distributed with means 0 and variances 1 (Theorem 3.3.1). t4 is the sum 
of the coordinates squared omitting the first i — 1. a 


Since the conditional distribution of /;j,...,4,; does not depend on 
V,,...,¥;_1, they are distributed independently of ty, tay, 152...) fi- ise 


Corollary 7.2.1. Let Z,,..., Z, (n =p) be independently distributed, each 
according іо МО, Г); let A = У" 12,2, = TT', where t; = 0, i<j, and t; > 0, 
i=1,..., p. Then ty, t3. .., ty, are independently distributed; t,; is distributed 
according to N(0, 1), i > j; and 12 has the y?-distribution with n — і + 1 degrees 


of freedom. 
Since t; has density 27 1" 7-0 ("7!e- 3^ /T[4(n + 1 — i)], the joint density 
of tj, jl... i im ls p, is 
qr exp(~ 32d) 
iei mi- Dm T[F(n+1—-i)] 


-> 


(6) 
Pata exp( - 3ER 17.107) 
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Let С be a lower triangular matrix (с, = 0, i<j) such that X = CC' and 
c, > 0. The linear transformation T* = CT, that is, 


(7) = È сац, : izj, 


can be written 


t e; 0 0 0 0 ty 
- x Cy 0 7 0 о || t4 
th X X Cy 0 0 || to 
(8) efef io : КИ" 
Р е o би 
pl рр р 
op x x x x Cop || lop 


where x denotes an element, possibly nonzero. Since the matrix of the 
transformation is triangular, its determinant is the product of the diagona: 
elements, namely, I12.,ci;. The Jacobian of the transformation from T to T* 
is the reciprocal of the determinant. The density of T* is obtained by 
substituting into (6) t; = t$ /c;; and 


р і 
(9) У Late IT’ 


=tr CT*T* (cy 
= tr T*T*'C' IC? 
=tr T*T*' E! = тУт, 
and using IIZ c} = 1СІІС = 151. 
Theorem 7.2.1. Let Z,,...,Z, (n z p) be independently distributed, each 


according to N(0, X); let А = Y^ .,Z,Z,; and let А = T* T*', where tj = 0, 
i<j, and t* > 0. Then the density of T* is 


р Ln е й Ы 


ай 
(10) wn 2g p 70/4 Г[1(п+1-1] 
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We can write (4) as а; = Xj tit} for h >i. Then 


да,. 
11 hi =, К> А, 
(1) ath, 


= 0), k=h, 1>5 


that is, да,/9щ=ОН k,l is beyond А, і in the lexicographic ordering. The 
Jacobian of the transformation from A to T* is the determinant of the lower 
triangular matrix with diagonal elements 


(12) pr 
(13) 3 =, h»i, 


The Jacobian is therefore 2°T1/.,1%’*'~', The Jacobian of the transforma- 
tion from T* to А is the reciprocal. 


Theorem 7.2.2. Let Z,,...,Z, be independently distributed, each according 
to N(0, €). The density of A = Y^ _,Z, Z, is 


|A| 707 De a ZA 
(14) Но ВАТ 
ти 1-7) 


for A positive definite, and 0 otherwise. 


Corollary 7.2.2. Let X,,...,Xy (N > p) be independently distributed, each 
according to Му, X). Then the density of A= X (X, - XXX, — X) is (14) 
forn=N-1. 


The density (14) will be denoted by w(A| X, п), and the associated distri- 
bution will be termed W(X, n). If n <p, then A does not have a density, but 
its distribution is nevertheless defined, and we shall refer to it as W(X, n). 


Corollary 7.2.3. Let X,,..., Xy (N > p) be independently distributed, each 
according to N(w, X). The distribution of S =(1/n)LN_(X,-X)X, — X) is 
WI /n)z, n], where n = № — 1. 


Proof S has the distribution of У" ,(1/vn)Z,](1/vn)Z,l, where 
(1/¥n)Z,,...,/vn)Zy are independently distributed, each according to 
N(0,(1/n)X). Theorem 7.2.2 implies this corollary. а 
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The Wishart distribution for p — 2 as given in Section 4.2.1 was derived by 
Fisher (1915). The distribution for arbitrary p was obtained by Wishart 
(1928) by a geometric argument using v,,.. ,v, defined above. As noted in 
Section 3.2, the ith diagonal element of Ai is the squared length of the ith 
vector, a; = v/v; = |102, and the i, jth off- diagonal element of А is the prod- 
uct of the lengths of v, and v; and the cosine of the angle between them. The 
matrix 4 specifies the lengths and configuration of the vectors. 

We shall give a geometric interpretation! of the derivation of the density 
of the rectangular coordinates ¢, p #>}, when È = I. The probability element 
of t, is approximately the probability that ||v,|| lies in the interval ¢,, < [21| 
Xt t dt. This is the probability that v, falls in a spherical shell in п 
dimensions with inner radius ¢,, and thickness dt,,. In this region, the density 
(2т)- *" exp(- зу, ) is approximately constant, namely, (21т)- #" expl- 3 ath). 
The surface area of the unit sphere in n dimensions is C(n) = 2r” /T(in) 
(Problems 7.1-7.3), and the volume of the spherical shell is approximately 
Снт ' dt. The probability element is the product of the volume and 
approximate density, namely, 


(15) 27 G7 Urt! exp( — 102) dt, /T ($n). 

The probability element of 1£,...,4; ,,t; given v,,...,4,_, (ie. given 
Vp. W; 4) is approximately the probability that v; falls in the region for 
which ty <, < t + dts, ...,1,;., < wi i Misi tia + dhii 


and 1; < [№ < ta — dta where w; is the projection of v; on the (n — і + 1)- 
dimensional space orthogonal to w,,...,;_,. Each of the first i — 1 pairs of 
inequalities defines the region between two hyperplanes (the different pairs 
being orthogonal) The last pair of inequalities defines a cylindrical shell 
whose intersection with the (;— 1)-dimensional hyperplane spanned by 
Uy... Uj-, is а spherical shell in n — i+ 1 dimensions with inner radius t;. 
In this region the "an (2т)- 7 expl- 10 ги.) is approximately constant, 
namely, (21r) 3" expl- 1r ай 2). The volume of the region is approximately 
Ф *- dt; Са — i + DIT! dij. The probability element is 


(тн _ 15 2 
276m- Dari- ехр( 27-111) | 


09 ro 1-7] un 


ü* 


Then the product of (15) and (16) for i= 2,..., p is (6) times dt, + dipp 
This аг alysis, which exactly parallels the geometric derivation by Wishart 
[and later by Mahalanobis, Bose, and Roy (1937)], was given by Sverdrup 


"in the first edition of this book, the derivation of the Wishart distribution and its geometric 


interpretation were in terms of the nonorthogonal vectors v,,.. + Up. 
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(1947) [and by Fog (1948) for р = 3]. Another method was used by Madow 
(1938), who drew on the distribution of correlation coefficients (for X = Г) 
obtained by Hotelling by considering certain partial correlation coefficients. 
Hsu (1939b) gave an inductive proof, and Rasch (1948) gave a method 
involving the use of a functional equation. А dilierent method is to obtain the 
characteristic function and invert it, as was done by Ingham (1933) and by 
Wishart and Bartlett (1933). 

Cramér (1946) verified that the Wishart distribution has the characteristic 
function of А. By means of alternative matrix transformations Elfving (1947). 
Mauldon (1955), and Olkin and Roy (1954) derived the Wishart distribution 
via the Bartlett decomposition; Kshirsagar (1959) based his derivation on 
random orthogonal transformations. Narain (1948), (1950) and Ogawa (1953) 


_ uscd a regression approach. James (1954), Khatri and Ramachandran (1958), 


and Khatri (1963) applied different methods. Giri (1977) used invariance. 
Wishart (1948) surveyed the derivations up to that date. Some of these 
methods are indicated in the problems. 

The relation А = ТГ’ is known as the Bartlett decomposition [Bartlett 
(1939)], and the (nonzero) elements of T were termed rectangular coordinates 
by Mahalanobis, Bose, and Roy (1937). 


Corollary 7.2.4 


UD ffi HP Pe dB = тте "^ Ттр 30-0), 


B»0 


Proof. Here B > 0 denotes B positive definite. Since (14) is a density, its 
integral for А > 0 is 1. Let È =I, A —2B (dA = 2 dB), and n = 21. Then the 
fact that the integral is 1 is identical to (17) for Га half integer. However, if 
we derive (14) from (6), we can let и be any real number greater than p — 1. 
In fact (17) holds for complex ¢ such that Æt >p – 1. (И! means the real 
part of t.) a 


Definition 7.2.1. The multivariate gamma function is 


(18) T (1) тео рг 4G 10]. 
i=] 


The Wishart density can be written 


|A| ia-p-1)g-i wy М 


(19) PAE) = Aix [er n) 


хоо КОЕН RS Cen LS BARNES A e атаки RR tp Ene ant S een n 


7.3. SOME PROPERTIES OF THE WISHART DISTRIBUTION 


73.1. The Characteristic Function 


The characteristic function of the Wishart distribution can be obtained 
directly from the distribution of the observations. Suppose Z,,...,Z, are 
distributed independently, each with density 


1 1.1% -1 
1 —— сер(-5'> х). 
(9 (27) 1х1 anb 
Let 
(2) A= Y Z,Z. 
а=1 


Introduce the p x p matrix Ө =(6,,) with 6;; = 0;. The characteristic func- 
tion of Ay), А... 424122413: 24,1, р 1$ 


(3) ё exp[itr(A@)] = sopin У 22,0) 


а= 1 


- 6 У 2.0%. 


asl 


ll 
Ss 
a 
? 

D 
ur 
N 
© 
N 

Q 
і 


It follows from Lemma 2.6.1 that 


1,92.) = П 4exp(iz;92Z,) = [6 ехр(і2'Ө2)]", 
=1 а= 1 


a= 


(4) 4 sel; 


where 2 has the density (1). For Ө real, there is а real nonsingular matrix В 
such that 


(5) B'S 'B=I, 

(6) В'ӨВ =D, 

where D is a real diagonal matrix (Theorem А....2 of the Appendix). If we let 
z = By, then 

(7) £ exp(iZ'@Z) = £ exp(iY' DY) 

- Пера) 


J 


р 
= Hé exp(id;Y?) 
jel 
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by Lemma 2.6.2. The jth factor in the second product is & exp(id,,Y;”), 
where Y; has the distribution N(0, 1); this is the characteristic function of the 
x?-distribution with one degree of freedom, namely (1 — 24)? [as can be 
proved by expanding exp(id;;y?) in a power series and integrating term by 
term]. Thus 


р E i 
(8) & exp(iZ'@Z) = II 235) * -|1-2iD|^* 


since I — 21р is a diagonal matrix. From (5) and (6) we see that 


(9) |I —2iD| = |B'X^! B – 2iB'OB| 
= |B'(X^! — 2109) В| 
-|B'|-|X ^! -2i9| -|Bl 
= |B|?-|X*! - 2:0], 


1B'|-|X7!|] -IBI = И = 1, and |B}? 2 1/1711. Combining the above re- 
sults, we obtain 


[Ум 


j= zor = lr -2i9x| ^. 


(10) & exp[itr(49)] = 


It can be shown that the result is valid provided ( 2(a'* — 2i6,,)) is positive 
definite. In particular, it is true for all real ®. It also holds for X singular. 


Theorem 7.3.1. If Z,,....Z, are independent, each with distribution 
N(0, X), then the characteristic function of Aquis Apps 2A... 2A 
where (А) =A = 12.12, Zo, is given by (10). 


p-lp 


7.3.2. The Sum of Wishart Matrices 


Suppose the A,, i = 1,2, are distributed independently according to W(X, n,), 
respectively. Then A, is distributed as У" ,Z, Z;, and 4, is distributed as 
21212,2, where Z,,..., Z, +n, are independent, each with distribution 
N(0, X) Then A =A,+A, is distributed as Z^ ,Z,Z^, where n 2 n, 4 n,. 
Thus A is distributed according to W(X, n). Similarly, the sum of д matrices 
distributed independently, each according to a Wishart distribution with 
covariance X, has a Wishart distribution with covariance matrix X and 
number of degrees of freedom equal to the sum of the numbers of degrees of 
freedom of the component matrices. | 
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Theorem 7.3.2. If A;,...,A, are independently distributed with A, dis- 
tributed according to W(X, nj), then 


4 
(11) A- YA 
is distributed according to ИСХ, УЯ уп). 


7.3.3. A Certain Linear Transformation 


We shal: frequently make the transformation 
(12) А = СВС', 


where C is a nonsingular p xp matrix. If А is distributed according to 
W(X, n), then B is distributed according to W(®, n) where 


(13) =e], 


This is proved by the folowing argument: Let A-Y5.12,Z,, where 
Z,,...,Z, are independently distributed, each according 10 N(0, X). Then 
Y, = C^'Z, is distributed according to №, Ф). However, 


n n 
(14) B= © YXYE-C'YZZC-cc- 


а&=1 а=1 


is distributed according to И(Ф, n). Finally, |@(4)/0(B)|, the Jacobian of 
the transformation (12), is 


3(4)| w(B,O,n) |B|f--PDppi _ pH 
(15) En "w(A,X,n) арр = МОС". 


Theorem 7.3.3. The Jacobian of the transformation (12) from A to B, where 
A and B are symmetric, is mod| C]? *! 


7.3.4. Marginal Distribntions 


If A is distributed according to W(X, n), the marginal distribution of any 
arbitrary set of the elements of 4 may be awkward to obtain. However, the 
marginal distribution of some sets of elements can be found easily. We give 
some of these in the following two theorems. 
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Theorem 7.3.4. Let A and X be partitioned тю q and p — 9 rows and 
columns, 


A, 4р s- X 3") 
(16) A= An Аз)’ Хи X» 


If A is distributed according to ИСХ, n), then Aj, is distributed according to 
ИС п, п). 


Proof. A is distributed as E% -1 Z, Z;, where the Z, are independent, each 
with the distribution N(0, X). Partition Z, into subvectors of q and p – 9 
components, Z, = (Z, Zi), Then 21",..., 217 are independent, each 
> а а за у . А . п р, А 
with the distribution N(0, ,,), and A, is distributed as У", 207207", which 


has the distribution ИС |}, n). a 


Theorem 7.3.5. Let A and X be partitioned into ру, pr,.... p, rows and 
columns (p, + -* +p 7 p), 


X, ee X 
(17) а= : Dp E 


If %,,=0 for i+j and if A is distributed according to ИС, n), then 

Ay, Ans A,, are independently distributed and A ,, is distributed according to 
И aerea ig 

W(X j,, n). 


Proof. A is distributed as Y. Z, Z;, where 2-62, are independently 
distributed, each according to №0, E). Let Z, be partitioned 


ZH 
(18) 2.=| : 
zi 


i (n 
as А and X have been partitioned. Since È, = 0, the sets 20'..., 
20,...,29,..., Z(? are independent. Then 4;, = 3-12 Za ee Aus 7 
у" „2024 are independent. The rest oi Theorem 7.3.5 follows from 
Theorem 7.3.4. a 
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7.3.5. Conditional Distributions 


In Section 4.3 we considered estimation of the parameters of the conditional 
distribution of. X* given ХО =x. Application of Theorem 7.22 to Theo- 
rem 4.3.3 yields the following theorem: 


Theorem 7.3.6. Let A and X be partitioned into q and p —q rows and 
columns as in (16). If A is distributed according to ИСХ, n), the distribution of 
Aya =An -Ap AxlAs is W(Z5,n-p *q)nzp-d. 


Note that Theorem 7.3.6 implies that 4,,.. is independent of А», and 
А, А51 regardless of X. 
74. COCHRAN'S THEOREM 


Cochran's theorem [Cochran (1934)] is useful in proving that certain vector 
quadratic forms are distributed as sums of vector squares. It is a statistical 
statement of an algebraic theorem, which we shall give as a lemma. 


Lemma 7.41. If the МХМ symmetric matrix C; has rank г, i = 1,...,m, 
and 


(1) У C; = Iu, 
then 
(2) : Y nsN 


i=] 


is a necessary and sufficient condition for there to exist an МХМ orthogonal 
matrix P such that fori=1,...,m 


0 0 0 
(3) PCP'-|0 I 0), 
оо 0 


where 1 is of order ғ, the upper left-hand 0 is square of order Yi (which is 
vacuous for i = 1), and the lower-right hand 0 is square of oraer Y ary (which 


is vacuous for i =m). 


Proof. The necessity follows from the fact that (1) implies that the sum of 
(3) over i=1,...,m is I. Now let us prove the sufficiency, we assume (2). 
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There exists an orthogonal matrix P, such that P,C,P; is diagonal with 
diagonal elements the characteristic roots of С. The number of nonzero 
roots is Т,, the rank of C;, and the number of 0 roots is N-rn. We write 


0 0 0 
(4) BCP-|9 А, 9|, 
0 0 0 


where the partitioning is according to (3), and A, is diagonal of order r,. This 
is possible in view of (2). Then ! 


оо 
-^, 0|. 


i 


0 


(5) nlcn-h-c)g-|0 ! 
i*j 


= O^ 


Since the rank of (5) is not greater than Y ,r, - r, - N — r;, which is the sum 
of the orders of the upper left-hand and lower right-hand P’s in (5), the rank 
of I А, is 0 and A — I. (Thus the ғ; nonzero roots of C; are 1, and C; is 


LA . В ! 
positive semidefinite.) From (4) we obtain 


‘ 0 0 0 
(6) C-P|0 1 0 Р-В,В,, 
ооо 


where B; consists of the г; columns of P; corresponding to Г in (6). From (1) 
we obtain 


B, 
nm В, 
(7) T= У В,В;=(В,, Bj,..., B,)| . | -P'P, 
jel : 
Вт 
where P —(B,, B,,..., Ви». а 


We now state а multivariate analog to Cochran's theorem. 


Theorem 7.4.1. Suppose Y;,..., Y, are independently distributed, each ac- 
cording to №0, X). Suppose the matrix (сі в) = C, used in forming 


N 
(8) Q= У св, | i-1,...,m 
а. B-1 , , 
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is of rank r,, and suppose 
m N 

(9) У 0,= LY. 
і= 1 а=1 


Then (2) is а necessary and sufficient condition for Q,..., Ом to be indepen- 
dently distributed with О; having the distribution W(X, r,). 


It follows from (3) that C; is idempotent. See Section A.2 of the Appendix. 

This theorem is useful in generalizing results from the univariate analysis 
of variance. (See Chapter 8.) As an example of the use of this theorem, let us 
prove that the mean of a sample of size N times its transpose and a multiple 
of the sample covariance matrix are independently distributed with a singular 
and a nonsingular Wishart distribution, respectively. Let Y,,...,Y, be inde- 
pendently distributed, each according to N(0, X). We shall use the matrices 
C, = (Q3) = Q/N) and С, = (CO) = [8,6 — (1/N)]. Then 


N 
1 -_ 
(10) 0, = У менм, 
м, 1 
(11) 0, = У (4s ххх 
а, В=1 
N —— 
= È YY,- NYY’ 


and (9) is satisfied. The matrix C, is of rank 1; the matrix С, is of rank N — 1 
(since the rank of the sum of two matrices is less than or equal to the sum of 
the ranks of the matrices and the rank of the second matrix is less than №). 
The conditions of the theorem are satisfied; therefore Q, is distributed as 
ZZ', where Z is distributed according to N(0, X), and Q, is distributed 
independently according to W(X, № — 1). 


Anderson and Styan (1982) have given a survey of proofs and éxtensions of 
Cochran’s theorem. 


7.5. THE GENERALIZED VARIANCE 


7.5.1. Definition of the Generalized Variance 


One multivariate analog of the variance о? of a univariate distribution is the 
covariance matrix X. Another multivariate analog is the scalar ||, which is 
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called the generalized variance of the multivariate distribution [Wilks (1932); 
see also Frisch (1929)]. Similarly, the generalized variance of the sample of 
vectors X,,..., Ху 15 


1 a, Ке 
(1) [S| = NI У 730.73) . 


In some sense each of these is a measure of spread. We consider them here ` 
because the sample generalized variance will recur in many likelihood ratio 
criteria for testing hypotheses. , 

A geometric interpretation of the sample generalized variance comes from 
considering the p rows of X = (x,,....xy) as p vectors in N-dimensional 
space. In Section 3.2 it was shown that the rows of 


(2) (x, 7X. xy -Я) =X е, 


where = = (1,...,1)", are orthogonal to the equiangular line (through the 
origin and =); see Figure 3.2. Then the entries of 


(3) A-(X-xe')(X-xe')' 


are the inner products of rows of X — xe’. l 

We now define a parallelotope determined by p vectors Vis- Pp TD an 
n-dimensional space (n > p). If р = 1, the parallelotope is the line segment 
v,. If p = 2, the parallelotope is the parallelogram with v; and v, as principal 
edges; that is, its sides аге v, V2, vi translated so its initial endpoint is at Va, 
and v, translated so its initial endpoint is at 1. See Figure 7.2. If р = 3. the 


parallelotope is the conventional parallelepided with р, r,, and v, as 


vi 


Figure 7.2. A parallelogram. 
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principal edges. In general, the parallelotope is the figure defined by the 
principal edges c,.....v,. И is cut out by p pairs of parallel (p-1)- 
dimensional hyperplanes, one hyperplane of a pair being spanned by p — 1 of 
Us. г, and the other hyperplane going through the endpoint of the 
remaining vector. 


Theorem 7.5.1. If И = (0,...,2), then the square of the p-dimensional 
volume of the parallelotope with v,,...,U,) as principal edges is |V'V|. 


Proof. lf p=1, then IV'V| = vie, = |2, which is the square of the 
one-dimensional volume of v,. If two k-dimensional parallelotopes have 
bases consisting of (К — 1)-dimensional parallelotopes of equal (k— 1)- 
dimensional volumes and equal altitudes, their k-dimensional volumes are 
equal [since the k-dimensional volume is the integral of the (k—1) 
dimensional volumes]. In particular, the volume of a k-dimensional parallelo- 
tope is equal to the volume of a parallelotope with the same hase (in k—-1 
dimensions) and same altitude with sides in the kth direction orthogonal to 
the first k — 1 directions. Thus the volume of the parallelotope with principal 
edges r,...., r4. Say Py, is equal to the volume of the parallelotope with 


principal edges r..... ру. Say Ру... times the altitude of P, over Px_13 
that is, 
(4) Vol Py) = Vol( P,..,) ХАЩ(РИР, 1). 


It follows (by induction) that 
(5) Vol( P,) = Vol( P,) х Alt( P2|P,) x + х Alt( P,|P,-1)- 


By the construction in Section 7.2 the altitude of P, over P, is да = lw, ll 
that is, f,, is the distance of р, from the (k — 1)-dimensional space spanned 
by 2...0 COT №. Wey). Hence Vok P,) = 1Ц- t4. Since Vin = 
ITT'| = ПР. 12, the theorem is proved. u 


We now apply this theorem to the parallelotope having the rows of (2) 
as principal edges. The dimensionality in Theorem 7.5.1 is arbitrary (but at 
least p). 


Corollary 7.5.1. The square of the p-dimensional volume of the parallelo- 
tope with the rows of (2) as principal edges is 14, where A is given by (3). 


We shall see later that many multivariate statistics can be given an 
interpretation in terms of these volumes. These volumes are analogous to 
distances that arise in special cases when р = 1. 
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We now consider a geometric interpretation of |A] in terms of N points in 
р-зрасе. Let the columns of the matrix (2) be у,,..., yy, representing N 
points in p-space. When p = 1, |A| = E, y2,, which is the sum of squares of 
the distances from the points to the origin. In general |A| is the sum of 


squares of the volumes of all parallelotopes formed by taking as principal 
edges p vectors from the set y,,..., Ум. 


We see that 
Y» Ut Уу. Ур1,а Y увУьв 
а а B 
6 4 : : А 
( ) | | У; у,-1,«Уа Уур 1. LY 1, вУрв 
a а B 
ХУ Уна UU ae ae Y» 
a a B 
L Yia Ut E YiaYp-1,a YıgYpg 


Уу. Ур-1,вУрв 


Уур Уа Ut L YpaYp-1.a Yeo 
" | 


by the rule for expanding determinants. [See (24) of Section A.1 of the 
Appendix.] In (6) the matrix А has been partitioned into p — 1 and 1 
columns. Applying the rule successively to the columns, we find 


N 
(7) 141 = У | Yi, Уз |. 
=1 ! 


P 


By Theorem 7.5.1 the square of the volume of the parallelotope with 
Уу У» Vi < 77 < У AS principal edges is 


2 
(8) V, а Yp | Ear) 


where the sum on В is over (y,,..., ,). If we now expand this determinant 
in the manner used for |Al, we obtain 


, . 
(9) Vy... yp 7 У ув, Ув, |, 
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where the sum is for each В; over the range (y,,..., у,). Summing (9) over all 
different sets (y, < + %), we obtain (7). (| Yig,Yj ві = 0 if two or more В; аге 
equal.) Thus |A} is the sum of volumes squared of all different parallelotopes 
formed by sets of p of the vectors y, as principal edges. If we replace y, by 
x,— X, we can state the following theorem: 


Theorem 7.5.2. Let |S| be defined by (1), where x pO Xy are the N 
vectors of a sample. Then |S| is proportional to the sum of squares of the 
volumes of all the different parallelotopes formed by using as principal edges p 
vectors with p of x,,..., xy as one set of endpoints and x as the other, and the 
factor of proportionality is 1/(N — 1). 


The population analog of [S] is | X|, which can also be given a geometric 
interpretation. From Section 3.3 we know that 


(10) Pr(x'£ Xx x2(a)]) =1-а 


if X is distributed according to N(0, X); that is, the probability is 1 — а that 
X fall inside the ellipsoid 


(11) хх = yx (а). 


The volume of this ellipsoid is C( p) XI ?[ x2(a)]” /p, where C( p) is defined 
in Problem 7.3. 


7.5.2. Distribution of the Sample Generalized Variance - 


The distribution of |S| is the same as the distribution of |A| ХОМ — 1)”, where 
А = Уа.12.2, and Z,,..., Z, are distributed independently, each according 
to МО, £), and n = N — 1. Let 7 „= СҮ,, а = 1,...,п, where СС’ = X. Then 
Yi»... Y, are independently distributed, each with distribution N(0, Г). Let 


n n 
(12) B= YY.YX- Y C'ZZ(C!)-cC4(C); 
a=] 


а=1 


then |A| = ICI -|Bl -|C'| = |B|-| XI. By the development in Section 7.2 we 
see that |B| has the distribution of ПР , £} and that 12,,..., 12, are indepen- 
dently distributed with y?-distributions. 


Theorem 7.5.3. The distribution of the generalized variance |S| of a sample 
Xy... Xy from N(p, È) is the same as the distribution of || /UN — 1)? times 
the product of p independent factors, the distribution of the ith factor being the 
x?-distribution with N — i degrees of freedom. 
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if p — 1, 15| has the distribution of [31 Xx GUN 0). If p 72, |81 bas 
the distribution of |El xq—1°xw-2/(N = 1%. It follows from Problem 1 от 
7.37 that when p = 2, |S| has the distribution of [ENC x2, 4X IC . 
can write 


2 
(13) АЕ = 121 X xg-i X ХМ -2 X 7 X XN-p' 
If p — 2r, then |Al is distributed as 


2 2 2r 
(14) [Х1( х2м-4 X Хув X7 X Хіма) 12. 


Since the Ath moment of a x^-variable with т degrees of freedom is 
2 Tm +h)/T(4m) and the moment of a product of independent variables 
is the product of the moments of the variables, the Ath moment of |A| is 


в | T[IQN - D FA} | sam ge ПА ГОСУ *^] 
(15) ett ea | -2 |x| Te. r[ioN- i) 


7 r[3(QN.- 1) +A] 
LsON ~ 1)] 


= 2413" 
Thus 
(16) да 109-0 
(m удар 10-0 тогаз - w=] 


where (А!) is the variance of |4]. 


7.5.3. The Asymptotic Distribution of the Sample Generalized Variance 


Let |B| /n? =V,(n) x Vi) x ^: x V (п), where the V's are independently 
istribute Vin) = x2.,,;- Since x2_,4; is distributed as 1275 W.. 
distributed and nV) = Xn-p+i p+i 15 dis S P centi 
wnere the W, are independent, each with distribution N(0, 1), the c 
limit theorem (applied to И/2) states that 


4422 
nV(n) —(n--p +!) pe 1+ Я 
C9 Туорт ah - 2 


1 аѕу to ically d st ibute cco di to 0 1 Then yn у ni~ 1 1S 
ng N( > ). “A ) 

S mp' t 1517г а accor 

asymptotically distributed according to N(O, 2). We now apply Theorein 4 2.3. 
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We have 


Vi( n) 1 
(19) Un)-| : |, b5-|:^ 
И, (п) 1 


[Bl /n? =w = fu, ...,u) = щити, T - 21, д}/дии-ь = 1, and TO; 
= 2p. Thus 


B 
(20) Yn (2 - 1) 
is asymptotically distributed according to N(0, 2p). 


Theorem 7.5.4. Let S be a p X p sample covariance matrix with n degrees of 
freedom. Then Уп (181/151 — 1) is asymptotically normally distributed with 
mean 0 and variance 2p. 


7.6. DISTRIBUTION OF THE SET OF CORRELATION COEFFICIENTS 
WHEN THE POPULATION COVARIANCE MATRIX IS DIAGONAL” 


In Section 4.2.1 we found the distribution of a single sample correlation when 
the corresponding population correlation was zero. Here we shall find the 
density of the set rjj, i <j, i, = 1,..., p, when р; = 0, i<j. 

We start with the distribution of A when ¥ is diagonal; that is, 
W{(o,;6,;), n]. The density of A is 


ij^ 


(1) [а,1 "7^7 P exp( — 4E Рла./о.) 
erp oT Gn) 
since 
о, 0 0 
g 0 On ... 0 p 
(2) I-| : : |= По. 
0 0 р 


We make the transformation 


(3) а Vai Уаз". іт], 


(4) dj = dj. 
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The Jacobian is the product of the Jacobian of (4) and that of (3) for a; 
fixed. The Jacobian of (3) is the determinant of a p( p — 1)/2-order diagonal 
matrix with diagonal elements ya;; /a,;. Since each particular subscript k, 
say, appears in the set r;; (i<j) p — 1 times, the Jacobian is 


Р 
(5) Ј= Пао. 
i-1 


If we substitute from (3) and (4) into w[41(o;;8;)), n] and multiply by (5), we 
obtain as the joint density of (a) and {r,,} 


las Qu, For P oxy — LEP. а/о.) P 

(6) ii Hu 7 : 2^.121*'*ii ii Паю-о 

in in i=} 

23" По Г, (in) ‘ 

+= 
- Ir o7? Р azn} exp( — заи/он) 
Dm) i-i 2 off | 

since 


(7) | |Van Vagnj|7 (Дањ. 


where r; = 1. In the ith term of the product on the right-hand side of (6), let 
a,,/(Qo;;) = u; then the integral of this term is 


ea»! exp( —34n/ Ty) © ani осш 1 
(8) f, др =] ui е ! du; = Г(5п) 
by definition of the gamma function (or by the fact that а/о; has the 
x?-density ‘vith n degrees of freedom). Hence the density of rj; is 


T^(in)|r, 1—20) 
Ty(2n) 
Theorem 7.6.1. If X,...,Xy are independent, each with distribution 


М, (0,8;)), then the density of the sample correlation coefficients is given by 
(9) where n =: М — 1. 


(9) 
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7.7. THE INVERTED WISHART DISTRIBUTION AND BAYES 
ESTIMATION OF THE COVARIANCE MATRIX 


7.7.1. The Inverted Wishart Distribution 


As indicated in Section 3.42, Bayes estimators are usually admissible. The 
calculation of Bayes estimators is facilitated when the prior distributions of 
the parameter is chosen conveniently. When there is a sufficient statistic, 
there will exist a family of prior distributions for the parameter such that the 
posterior distribution is a member of this family; such a family is called a 
conjugate family of distributions. In Section 3.4.2 we saw that the normal 
family of priors is conjugate to the normal family of distributions when the 
covariance matrix is given. In this section we shall consider Bayesian estima- 


tion of the covariance matrix and estimation of the mean vector and the 
covariance matrix. 


Theorem 7.7.1. Jf A has the distribution W(Z,m), then B = A^! has the 
density 
IF| "| BI = Юнжр+1) етт В! 


(1) rT Gm) 


for B positive definite and 0 elsewhere, where Y = 5.1, 


Proof. By Theorem A.4.6 of the Appendix, the Jacobian of the transfor- 
mation A = B^! is |B| (^*". Substitution of B^! for A in (16) of Section 7.2 
and multiplication by |B] (^*" yields (1). и 


We shall call (1) the density of the inverted Wishart distribution with m 
degrees of freedom! and denote the distribution by И, т) and the 
density by w7'(B/W, m). We shall call W the precision matrix or concentra- 
поп matrix. 


7.7.2. Bayes Estimation of the Covariance Matrix 


The covariance matrix of a sample of size N from N(p, £) has the distribu- 
tion of (1/n)A, where А has the distribution ИС, п) and n2 № 1. We 
shall now show that if € is assigned an inverted Wishart distribution, then 
the conditional distribution of € given А is an inverted Wishart distribution. 
In other words, the family of inverted Wishart distributions for X is conju- 
gate to the family of Wishart distributions. 


T The definition of the number of degrees of freedom differs from that of Giri (1977), p. 104, and 
Muirhead (1982), p. 113. 
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Theorem 7.7.2. If A has the distribution W(X., n) and X has the a priori 
distribution W- (3E, m), then the conditional distribution of €. is "(А+ 
№, п +т). 


Proof. The joint density of A and X is 
Iwi "| У т tpt) А in-p-1) е CROSS 
(2) 256" +трГ (2м), (am) 


for А and X positive definite. The marginal density of А is the integral of (2) 
over the set of X positive definite. Since the integral of (1) with respect to B 
is 1 identically in Ч, the integral of (2) with respect to £ is 


1 1 - - inm) 
T, [3(n + т) Ар DIA + wt 
(3) (уг, (т) 


for А positive definite. The conditional density of € given А is the ratio of 
(2) to (3), namely, 


[А+ |”) – Мажтяр+1) о irGA ух C! 
(4) 2o emp [+ (п +m)| , 
which is w^ (А+ V,n +m). Г 


Corollary 7.7.1. If nS has the distribution W(X, n) and & has the a priori 
distribution ИТС, т), then the conditional distribution of X given S is 
W-!(nS + Ҹ, и + m). 


Corollary 7.7.2. If nS has the distribution W(X,n) X€ has the a priori 
disiribution ИС, m), and the loss function is tr(D — 2)G(D — X)H, where 
G and H are positive definite, then the Bayes estimator for X is 

1 
—————À tW). 
(5) "nxm-p-iU$ ) 
Proof. It follows from Section 3.4.2 that the Bayes estimator for X is 


é(X|S). From Theorem 7.7.2 we see that €^! has the a posteriori distribu- 


res ing lemma. 
tion W[(nS + №), n +m). The theorem results from the following lem а 


Lemma 7.7.1. If A has the distribution W(X, n), then 


(6) р 


-1 
= 5р1 . 
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Proof, If C is a nonsingular matrix such that У = СС’, then A has the 
distribution of CBC’, where B has the distribution ИС, п), and 447! = 
(C)-(4B-')C^!. By symmetry the diagonal elements of ZB^! are the 
same and the off-diagonal elements are the same; that is, ФВ =k I+ 
К. ==’. For every orthogonal matrix Q, ОВО’ has the distribution ИС, n) and 
hence Z(QBQ') = Q4 B^! Q' = 6 В”. Thus к, = 0. A diagonal element of 
B^! has the distribution of ( x2.,,,)^'. (See, e.g, the proof of Theorem 
5.22) Since &( x2.,,,)! = (ир D7', B ! 2(n—-p — 0-4. Then (6) 
follows. и 


We note that (n —p — DA = [(n -p — 1) /(и - 157" is an unbiased 
estimator of the precision £7’. 

If р is known, the unbiased estimator of X is (1/N)Y A (x, — Xx, — 
11)’. The above can be applied with n replaced by N. Note that if n (ог №) is 
large. (5) is approximately S. 


Theorem 7.73. Let x,,..., xy be observations from N(p,%). Suppose p 
and X have the a priori density (plv, (1/K)X) x w^! (E| Ww, т). Then the a 
posteriori density of p and X. given x = (1/№)У ух, and S = Q/0)YEN iG, 
-XXx,-x) is 


. 1 1 
(7) Ы rcg C Ко) р) 
- NK ,. - , 
з (р 4 ns р (y my. N +m). 


Proof. Since х and nS = А are a sufficient set of statistics, we can consider 
the joint density of x, A, p, and X, which is 
KY N?P| Y] ixi = КМ+т+р+2 Al ИМ-р-2) 
HN 1) 1 1 
Qi ims ПРтРГ, [3(N — 1)]T (3m) 


(8) 


exp - 3ÍN(E - p) E (x- p) ctr AX! 
-K(p—-vYX (jg — v) tr wry}. 


The marginal density of x and A is the integral of (8) with respect to p and 
X. The exponential in (8) is — $ times 


(9) (N-K)pX'&-2(NE-Kv)X в 
+ МУКЪУТ у (А+)! 
1 | Te- 1 
=(N+K) n- (+ Ко) $ [e eR OR К») 


+ МА (# -vE (v) +п(А +). 
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The integral of (8) with respect to p is 


(io) КМ Ene e 
(МК таг [3 (№ - 0]T Gm) 


-exp{ — 5 тАУ + XE v)'E-(X-w)-^tr vx. 


In turn, the integral of (10) with respect to X is 


an) K? NIT,[3(N +m)] 
m? [QN - 2)]T (3m)(N +K)” 

vL in NK ,. = ууу Kem 

Apea y А erg Gv) v) nnm 


The conditional density of р, and X given X and А is the ratio of (8) to (11), 
namely, 


мку mop p A + МК (я yy(g yv onem 
NK (i 


2i m1 АРТ, [$ (№ + т)| 


(12) 


es[- 3 [(N+K)[ - gig Se e| ze eg o) 


NK ,. - dw 
eo wA мк 5) (#-ъ) |z d 
Then (12) can be written as (7). и 


Corollary 7.7.3. Шхь... xy are observations from N(p,%), if р. апа X. 
have the a priori density niplv, Q/K)E] x w^! CE| W, m), and if the loss 
function is (d — p)'J(d — р) - tr(D — X)GCD - X)H, then the Bayes estima- 
tors of и, and È are 


(13) We (NE+KY) 
and 

1 NK ,. = , 
(14) NYm-p-1 nS t V + xg Y-Y) , 
respectively. 


The estimator of р is a weighted average of the sample mean x and the 
a priori mean v. If М is large, the a priori mean has relatively little weight. 
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The estimator of X is a weighted average of the sample covariances S, Y, 
and a term deriving from the difference between the sample mean and the a 


priori mean. If N is large, the estimator is close to the sample covariance 
matrix. 


Theorem 7.74. If x,,..., xy are observations from N(p, X) and if p and 
È have the a priori density nluv, (1/K)2] X w^ (E|W, m), then the marginal 
а posteriori density of и, given $ and S is 


(15) 
(N - K)PT [IN +m + В 
mPT[3(N & m  1-p)][1 4 (N  K)(p в?) В (p — p) 770" 
where p* is (13) and B is N - m —p – 1 times (14), 
Proof. The exponent in (12) is — 1 times 
(16) [В+ (+ К) (р – и) (в ру). 
Then the integral of (12) with respect to X is 


(N - K)PT [3(N m + 1)] BI IN em) 
(17) т 1 > , 
TPT [3(М+т)]|В + (+ К) ре) (р py END 


Since |B +xx'] = [BICI + x'B7!x) (Corollary A.3.1), (15) follows. | 


The density (15) is the multivariate t-distribution with N-tm+1—p 
degrees of freedom. See Section 2.7.5, Exarples. 


7.8. IMPROVED ESTIMATION OF THE COVARIANCE MATRIX 


Just as the sample mean x can be improved on as an estimator of the 
population mean p when the loss function is quadratic, so can the sample 
covariance S be improved on as an estimator of the population covariance X, 
for certain loss functions. The loss function for estimation of the location 
parameter p was invariant with respect to translation (x > x + a, р > p. +a), 
and the risk of the sample mean (which is the unique unbiased function of 
the sufficient statistic when X is known) does not depend on the parameter 
value. The natural group of transformations of covariance matrices is multi- 
plication on the left by a nonsingular matrix and on the right by its transpose 
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(x > Cx, S 2 CSC', X > СУС"). We consider two loss functions which are 
x E , А 
invariant with respect to such transformations. 

One loss function is quadratic: 


(1) L,(X,G) -t(G - X)E^ (6 - X)X^ 
-u(GX'! -Iy, 


А А h 
where G is a positive definite matrix. The other is based on the form of the 
likelihood function: 


(2) L((X,G)-trGX^! -loglGX"!| - p. 


(See Lemma 3.2.2 and alternative proofs in Problems MD 3.129 pach 
is 0 i iti hen С + 5. The se s 
f these is 0 when С = X and is positive w . 
function approaches œ as G approaches a singular matrix or Md one or 
teristic roots) of G approaches c. (5 
more elements (or one or more characteris í roaches oe. Sce 
3.2.2.) Each is invariant with respect to tr ions 
or a COCA gre CC We can see some properties of the loss MAN 
a ‘= | = УР (d; — log d; - 1), where 
from L (I, D) = EF. (d; — 1? and 1 (1, D) ELK J MN where 
D is diagonal. (By Theorem A.2.2 of the Appendix or arbi pi ive 
definite € and symmetric G, there exists а nonsingular C such that С 
and CGC'- D) If we let g = (81, ..-, Врр 8n врн НИ 
(Sy es Spp 125» Sp-1, pP s о (абрро t p A T 
2605-9) then L(G) is a constant multiple of (g — 0 g . 
$ blem 7.33.) А l l | 
ehe maximum likelihood estimator $ and the unbiased estimator S are of 
the form aA, where А has the distribution ИСУ, п) and n = № — 1. 


ic ri. is minimized at a =1/(n +p + 1). 
7.8.1. The quadratic risk of aA is minimize lp 
and is value is рр + 0) (п +p + 1). The likelihood risk of aA is minimized at 
a= 1/n (ie., aA = S), and its value of p log п — EL, e log Хин: 


Proof. By the invariance of the loss function 
(3) Ey L (£, aA) = 6;L,(1, aA") 


2 
= &, tr(aA4* — I) 
p p 
= $, aj У, at? — 2a} a% +p 
i j=1 i=l 


=a?|(2n + п?)р + пр(р – 1)] — 2anp +p 
-p[n(n * p * 1)à - 2па + 1]. 
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which has its minimum at а = 1/(n + p + 1). Similarly 
(4) £5 L(X,aA) = 6, L(1,a4*) 
= £, {atr A* — log|A*| — p loga — p) 
= p[na – log a — 1] ~ $, 10814*|, 
which is minimized at a — 1/n. и 


Although the minimum risk of the estimator of the form aA is constant for 
its loss function, the estimator is not minimax. We shall now consider 
estimators G(A) such that 


(5) G( HAH’) = HG( A) H' 


for lower triangular matrices H. The two loss functions are invariant with 
respect to transformations G* = HGH', 5* = HXH'. 

Let A = Гапа Н be the diagonal matrix D; with —1 as the ith diagonal 
element and 1 as each other diagonal element. Then HAH’ = I, and the 
i, jth component of (5) is 
(6) gj (1) = (J), j*i. 
Hence, g;(1) = 0, i +, and G(I) is diagonal, say D. Since А = ТТ’ for f 
lower triangular, we have 
(7) G( A) = С(ТІТ') 

- TG(I)T' 

= TDT', 
where D is a diagonal matrix not depending on А. We note in passing that if 
(5) holds for all nonsingular Н, then D = al for some а. (H can be taken as 


a permutation matrix.) 
If X —KK', where K is lower triangular, then 


(8) 
Es L[2,6(4)] = fL. GCA)] CCo n) zl "Phal 7r e 9374 dA 
= f LLKK',GCA)] CCo, 1)! KR’ =n) ap eco-D 


EN! 1-1 
-e gu K''K dA 
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= [LL KK’, 6(кА*К’)]С(р, n)|A*| РП ета? dA* 
= é L[KK', KG( 4*) K'] 
= €,L[I,G(A*)] 


by invariance of the loss function. The risk does not depend on X. 
For the quadratic loss function we calculate 


(9) &,L,[1,G(A)] = &j L, [1 TDT'] 
= & tr(TDT' - Г)? 
= й, t(TDT'TDT! – 2TDT' + Г) 


р р 
=é, У tgdjttudt,— 24, Y, tid; +p. 
i,j,k,l=t i,j=l . 


The expectations can be evaluated by using the fact that the (nonzero) 
elements of T are independent, £2 has the x?distribution with n -- 1—i 
degrees of freedom, and ¢;,, {> j, has the distribution N(0, 1). Then 


(10) £L, I, G(4)] - d'Fd - 2f'd + p, 
where Е = (fi), f= (р), 


(11) fu = (n*p-2i* 1)(n +p — 2 +3), 
fj-n*p-2j*l, i«j, 


fpentp+2i+1, 


and d=(d,,...,d,)’. Since d'Fd= ё tr(TDT')’ > 0, Е is positive definite 
and (10) has a unique minimum. It is attained at d= Е, and the minimum 
is р- РЕЈ. 


Theorem 7.8.2. With respect to the quadratic loss function the best estimator 
invariant with respect to linear transformations X. — HXH', А > HAH', where 
H is lower triangular, is G(A) — TDT', where D is the diagonal matrix whose 
diagonal elements compose d = F^!f, Е and f are defined by (11), and A = TT' 
with T lower triangular. 
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Since d — F^'f is not proportional to е —(1,...,1)', that is, Fe is not 
proportional to f (see Problem 7.28), this estimator has a smaller (quadratic) 
loss than any estimator of the form aA (which is the only type of estimator 
invariant under the full linear group) Kiefer (1957) showed that if an 
estimator is minimax in the class of estimators invariant with respect to a 
group of transformations satisfying certain conditions,’ then it is minimax 
with respect to all estimators. In this problem the group of triangular linear 
transformations satisfies the conditions, while the group of all linear transfor- 
mations does not. ' 

The definition of this estimator depends on the coordinate system and on 


the numbering of the coordinates. These properties are intuitively unappeal- 
ing. 


Theorem 7.8.3. The estimator G( A) defined in Theorem 7.8.2 is minimax 
with respect to the quadratic loss function. 


In the case of p =2 


(12) gaint) _ (n * 1)(n + 2) 
to(n*l1)(ne3)-(n-1)! © (nel(n«3)-(n-1) 

The risk is 

(13) 3n? + 5п +4 


n? + 512 + бп +4 


The difference between the risks of the best estimator aA and the best 
estimator TDT' is 


(14) 6 би? + 10n + 8 _ 2n(n - 1) 
п+3 n+5n?+6n+4 (n-3)(m + 5п2 +60 +4) ` 


The difference is 4 for п = 2 (relative to $), and $ for п = 3 (relative to 1); 
it is of the order 2/n?; the improvement due to using the estimator ТОТ’ is 
not great, at least for p — 2. 

For the likelihood loss function we calculate 


(15) 6,L,[1,G(A)] 
= &L(U,TDT'] 
= æ [и TDT' — log| TDT'| — p] 


‘The essential condition is that the group is solvable. See Kiefer (1966) and Kudo (1955). 
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p р | р 
=, У tdi- У ові | - È logd;- p 
i-i 


i j=l i=l 


p 


p р . 
= У (п+р-2ј+1)4,- У logd,- У log x; 7 P- 
je! j=! jel 


The minimum of (15) occurs at d; = 1/(п +р – 2) +1), j=1,...,p. 


Theorem 7.8.4. With respect to the likelihood loss function, the best estima- 
tor invariant with respect to linear transformations Ў эНӰН', A> НАН', 
where Н is lower triangular, is G(A) = TDT’, where the jth diagonal element of 
the diagonal matrix D is 1/(n +p -2j + 1), j = L..., р, and A = TT', with T 
lower triangular. The minimum risk is 


р р 
(16) 41[5,С(4)| = У іов(п+р-2ј+1) - L 4 log xiij 
j=l jel 


Theorem 7.8.5. The estimaior G(A) defined in Theorem 7.8.4 is minimax 
with respect to the likelihood loss function. 


James and Stein (1961) gave this estimator. Note that the reciprocals of 
the weights 1/(n + p — 1),1/(n +p — 3),...,1/(n — p + 1) are symmetrically 
distributed about the reciprocal of 1/n. 


If p=2, 
Eau. i [lal 
(17) G(4)-zi14*|o 2 1 
п — 1 
2 0 0 
п 
(18) 2904) =т=т + ято DM 
i} 


The difference between the risks of the best estimator a4 and the best 


. estimator TDT' is 


Р Р p-2j*1 
(19) plogn— Ylog(n*p-2j*1)- — Ув. 
j=l j=l 
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If p = 2. the improvement is 


(20) -tog[1 + 5) 8 (1- =) = -toe{1 - 4] 


which is 0.288 for n = 2, 0.118 for n = 3, 0.065 for n = 4, etc. The risk (19) is 
O(1/n?) for any p. (See Problem 7.31.) 

Ап obvious disadvantage of these estimator: is that they depend on the 
coordinate system. Let Р, be the ith permutation matrix, i = 1,..., p!, and iet 
P, AP; = T,T;, where Т, is lower triangular and £j» 0, j = 1,..., p. Then a 
randomized estimator that does not depend on the numbering of coordinates 


is to let the estimator be P/T, DT/P, with probability 1/p!; this estimator has 


i 


the same risk as the estima.or for the original numbering of coordinates. 
Since the loss functions are convex, (1/p!)L,P/T, DT; P, will have at least as 
good a risk function; in this case the risk will depend on X. 

Haff (1980) has shown that G(A) = [1/(n + p+ DKA + yuC), 
where y is constant, 0 < y x 2(p — D/(n — p +3), и = 1/ti(A^! C) and С is 
an arbitrary positive definite matrix; has a smaller quadratic risk than 
Иа + p + ПМ. The estimator GCA) = (1/n)[ A + ut(u)C], where Ки) is ап 
absolutely continuous, nonincreasing function, 0 < Ки) < 2( p — 1)/n, has a 
smaller likelihood risk than S. 


79. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


7.9.1. Observations Elliptically Contoured 


Consider x,,..., xy observations on a random vector X with density 


(1) IAI [(х- v) A^ (x - v)]. 


Let A- EN (x, -XYXx,—X), n-N- 1, S=(1/n)A. Then 5 5X as 
N — o. The limiting normal distribution of УМ vec(S — X) was given in 
Theorem 3.6.2. 


The lower triangular matrix T, satisfying А = TT', was used in Section 7.2 


in deriving the distribution of 4 and hence of S. Define the lower triangular 
matrix T by $ = TT', Е, 2 0, i— 1,..., p. Then Т= (1/ Vn)T. If X = 1, then 
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S ^I and T ^1, /N(S — I) and /N(f Г) have limiting normal distribu- 
tions, and 


(2) YN (S — Г) - VN(T -1) + V|N(T-1)' +0,0). 


That is, VN (s; — 1) = 2YN (f, — taD and VN s; = УМЕ, +0,0), i>j. 
When £=], the set VN (s, — LN G,, — 1) and the set IN sip, 
are asymptotically independent; s, N 5... VN. Ns,_1, are mutually asymptot- 
ically independent, each with variance 1+к; the limiting variance of 


VN (s; — 1) is 3x + 2; and the limiting covariance of VN (s; — 1). and YN (s; 
— 1), i#j, is к. 


ij, 


Theorem 7.9.1. If X = І,, the limiting distribution of VN (T — I,) is normal 
with mean 0. The variance Jt a diagonal element is (Зк + 2)/4; the covariance of 
two diagonal elements is к/4; the variance of an off-diagonal element is к+ 1; 
the off-diagonal elements are uncorrelated and are uncorrelated with the diagonal 
elements. 


Let X — + CY, where Y has the density g(y'y), А = CC', and X = é(X 
— vXX — vy = CER? /p) A = ГГ', and C and Г are lower triangular. Let S 
be the sample covariance of a sample of N on X. Let S = ТГ’. Then $ ^ X, 
T ^ Y, and 


(3) IN (S - X) = VN(T- Г)Г' + TVN(T - Г)' +0, (1). 


The limiting distribution of VT (T — Г) is normal, and the covariance can be 
calculated from (3) and the covariances of the elements of /N(S — X). Since 


the primary interest іп Т is to find the distribution of S, we do not pursue 
this further here. 


7.9.2. Elliptically Contoured Matri:: Distributions 
Let X (N X p) have the density 


(4) Icl "g[C7 (x ем) eyv)(c7] 


based on the left spherical density g(Y'Y). 


Theorem 7.9.2. Define T = (1,,) by Y'Y- TT', Lj 7 0, i «j, and tj z 0. If 
the density of Y is g(Y'Y), then the density of T is 


р г 2 ла(№+1-1) А 27% 
(5) Прот" 807) = т. Г 'g(TT"). 
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Proof. Let Y = (v,,..., v,). Define w; and w; recursively by юу = 01, ш = 


Лю 


1-1 юр, 1-1 
(6) w;-v- ),€——;-v- uy, 
jel ІА! j=l 

and и, = w;/liw;l. Then юю = 0, ши =0,i#j, and ии, = 1. Conditional on 
Uy. SUL, (that is, w,,...,;_,), let О; be an orthogonal matrix with 
и»... Ш as the first i — 1 rows; that is, 
(7) Q' = (u,,...,4;_,;,Q*'). 
(See Lemma A.4.2.) Define 

f 
(8) z; Qv; = 

te ia 
27 


This transformation of v, is linear and has Jacobian 1. The vector 2# has 
N +1 – і components. Note that ||z* |" = lwl’, 


i-i 1—1 
(9) v= È tju +в, = Хи +0, 
j=l i=l 
i-1 i 
(10) ци тина X 
j=l j=l 
` j 
(11) ии = E ttir j<i. 
k=l 


The transformation from Y = (v,,. . .,v,) to 21,...,2, has Jacobian 1. 
To obtain the density of T convert 2* to polar coordinates and integrate 
with respect to the angular coordinates. (See Section 2.7.1.) " 


The above proof follows the lines of the proof of (6) in Section 7.2, but 
does not use information about the normal distribution, such as t2 © x2, .;. 


See also Fang and Zhang (1990), Theorem 3.4.1. 
Let С be a lower triangular matrix such that А = СС’. Define X = ҮС”. 


Theorem 7.9.3. Jf X (N Xp) has the density 


(12) Ici "e| c^x'x(c) |], 
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then the lower triangular matrix Т* satisfying X'X = T* T*' and t > 0 has the 
density 


2 Papp 


(13) [le "tele rr (с). 


| E,(GN)IAP i-i 
Let A-X'X- T*T*'. 
Theorem 7.9.4. If X has the density (12), then A = X' X has the density 


qp o 0I 


14 —_ А АР C^!A C' -1 . 
(14) ГМА" [ce] 


The class of densities g(tr Y'Y) is a subclass of densities g(Y'Y). Let 
Х= £y v' + YC'. Then the density of X is 


(15) IAI Pe[tr(X— e yv) A (X eye)’. 
A stochastic representation of X is vec X £ R(C@1,)vecU + v 8 ey. Theo- 


rems 7.9.3 and 7.9.4 can be specialized to this form. Then Theorem 3.6.5 
holds. 


Theorem 7.9.5. Let X have the density (12) where А is diagonal. Let 
S-(N-1)^(X-s,X)(X-e,X) and Е = (diag S)" 5S(diag S) 5. Then 
the density of R is (9) of Section 7.6. 

PROBLEMS 
7.1. (Sec. 7.2) A transformation from rectangular to polar coordinates is 


у; = из 6, 
Уз = и с0$ 8; sin 65, 


Y3 = w cos 6, COS £, sin 63, 


Ул = WCOS 0, COS 0 --- cos B, .; Sin 8,1, 


Yn = W COS 0; COS 8, -*- COS 6, COS 6,1. 


<ir, i-1....n-2, —mT «8, < v, and 0 < 
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2 

(a) Prove w? = Ey2. [ Hint: Compute in turn y2 + y2.,,(y2 * y2.,)  yz-2, and 
so forth.] | 

(b) Show that the Jacobian is w"~! cos" ^? 6, cos"? 6; -- cos 0,2. [ Hint: Prove 


cos 0; 0 Us 0 0 
0 с0$ 02 ~ 0 0 
(y, Yad . : : 
2(8,....,0, 1, w) 0 0 сова 0 
wsin@, wsin&, >e  wsinG,, 1 
w x Ut x x 
O0 wcosQ, -c- X x 
0 0 ss weos@, + COS ea x 
0 0 Ut 0 cos@, ‘`` cos ei 


where x denotes elements whose explicit values are not needed.] 
7.2. (Sec. 7.2) Prove that 


4 Г(58)г(2) 
т/2 А-1 = 2 2 А 
[rcs 040 Ти] 


[ Hint: Let cos? 0 = и, and use the definition of В(р, q).] 


7.3. (Sec. 7.2) Use Problems 7.1 and 7.2 to prove that the surface area of a sphere of 
unit radius in n dimensions is 


7.4. (Sec. 7.2) Use Problems 7.1, 72, and 7.3 to prove that if the. density of 
UT у ` : tos dn- 
y! =yp.:: xS) В fly’y), then the density of u = y'y is CoD fu?" '. 


5. (S istributü hat if y,,...,y, are 
7.5. (Sec. 72) y*-distribution. Use Problem 7.4 to show t Ув 
independently distributed, each according to N(0, 1), then U = У" ye has the 
density url en i D Tn), which is the y?-density with n degrees of 
Ireedom. 


7.6. (Sec. 7.2) Use (9) of Section 7.6 to derive the distribution of А. 


7.7. (Sec. 7.2) Use the proof of Theorem 7.2.1 to demonstrate Pr{|A] = 0) = 0. 
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7.8. (Sec. 7.2) Independence of estimators of the parameters of the complex normal 
distribution. Let z,,..., z,, be N observations from the complex normal distribu- 
tion with mean 6 and covariance matrix P. (See Problem 2.64.) Show that Я 
and 4= Е. (2, – ZXZ, — 2)* are independently distributed, and show that 
A has the distribution of Z7 И, where W,,...,W, are independently 


distributed, each according to the complex normal distribution with mean € and 
covariance matrix P. 


7.9. (Sec. 7.2) The complex Wishart distribution. Let W,,...,W, be independently 
distributed, each according to the complex normal distribution with mean 0 and 
covariance matrix P. (See Problem 2.64) Show that the density of B= 
Ха И is 


|B|" "Pen BP 


Р-Р Tin +1—i) 


7.10. (Sec. 7.3) Find the characteristic function of A from W(X, n). (Hint: From 
[wCAI X, п) dA =", one derives 


| 141 $n 7p D exp( -tr 714) dA 
2%" (ip) 


[d]? 


as an identity in Ф.] Note that comparison of this result with that of Section 
7.3.1 is a proof of the Wishart distribution. 


7.11. (Sec. 7.3.2) Prove Theorem 7.3.2 by use of characteristic functions. 


7.12. (Sec. 7.3.1) Find the first two moments of the elements of А by differentiating 
the characteristic function (11). 


743. (Sec. 73) Let Z,,...,Z, be independently distributed, each according to 
N(0, Г). Let W= YS 6-1 bag Za Zg- Prove that if a'Wa = x? for all a such that 
a'a = 1, then W is distributed according to W(I, m). [ Hint: Use the characteris- 
tic function of a'Wa.] 


7.14. (Sec. 7.4) Let x, be an observation from N(Bz,, >), а= 1,..., М, where Za is 
a scalar. Let b= X,2,x,/X,22. Use Theorem 7.4.1 to show that Y x x — 


bb'Y.,22 and bb’ are independent, — 
7.15. (Sec. 7.4) Show that 
h h ` 
в (хи хи) = (2.4/4) , А> 0, 


by use of the duplication formula for the gamma function; X&-, and х2_› are 
independent. Hence show that the distribution of хӯ X65. is the distribution 
of xiu 4/4. 
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7.16. (Sec. 7.4) Verify that Theorem: 7.4.1 follows from Lemma 7.4.1. [Hint: Prove 


747. 


7.18. 


7.19. 


7.20. 


that О; having the distribution W(X, r,) implies the existence of (6) where I is 


of order r; and that the independence of the Q/'s implies that the I's in (6) do 
not overlap.] 


(Sec. 7.5) Find #1А|* directly from W(X, n). [Hint: The fact that 


[v (A10) ЧА =1 


Shows 
ДАР exp( — dtr X74) dA = 20] xp "T, (30) 
as en identity in n.] 


(Sec. 7.5) Consider the confidence region for р, given by 


Ғ 'e-lrx (N 1 
N(X- u*)'S" (x-p*)x ODE, N-p (8): 

where X and S are based on a sample of N from №, X). Find the expected 
value of the volume of the confidence region. | 


(Sec. 7.6) Prove that if X = 1, the joint density of 7,,,, i j= 1,. 


Pss Ty d$ 


...Рр-— 1, and 


Г? [10а - JIR pl PP p-1 T(in) 


TENTEA - i ~ a ra (1 SAH 3) 
т ПАГ (и - О] т трапер] 7 0^7 


where К). - Cup [ Hint: п. = Tip 0/41 РУ 1-1) and Iryl = 
|y1 =r} V1-rjri,l. Use o 


(Sec. 7.6) Prove that the joint density of тз. pil 13:4, pi 294, pris 
arena 2 Q5 BLO—iLUUMM 
T(2In (р-2)]} ( 2 ye 
Ри g 
+ Tle - Co - ]) 
. П Г P 1- г2. (л-р) 
Гиор o 5) 


Tp lied oss 


[ Hint: Use the result of Problem 7.19 inductivity.] 
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7.21. 


7.22. 


7.23. 


7.24. 


7.25. 


7.26. 


7.27. 


7.28. 


7.29. 


7.30. 
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(Sec. 7.6) Prove (without the use of Problem 7.20) that if Х = /, then 
Гар» -- Рр-1, р are independently distributed. [ Hint: r, — a; a;p/ (Vau ya Vire ). Prove 
that the pairs (2,,, 25... , (45 1,5 @р-1.р- j are independent when 
(5s... 2,5) are fixed, and note from Section 4.2.1 that the marginal distribu- 
tion of M р» conditional on z,,, does not depend on z, m 
(Sec. 7.6) Prove (without the use of Problems 7.19 and 7.20) that if £ = I. then 
the set rjj... 7,1, р IS independent of the set ryp 4j = 1... p~ i. Hint: 
From Section 4.3.2 app and (a,,) are independent of (а, р): Prove that 
(a;,), and а; ..р-1, Pare independent of (r;;.„} by proving that 


pp: ip^ ii i=1,. 
). See Problem 4.21.] 


ip 
are independent of (к 


a; 


iip ijp 


(Sec. 7.6) Prove the conclusion of Problem 7.20 by using Problems 7.21 and 
7.22. 


(Sec. 7.6) Reverse the steps in Problem 7.20 to derive (9) of Section 7.6. 


(Sec. 7.6) Show that when p=3 and £ is diagonal rj,,r;4,75, are not 
mutually independent. 


(Sec. 7.6) Show that when X is diagonal the set г; are pairwise independent. 
(Sec. 7.7) Multivariate t-distribution. Let y and и be independently distributed 
according to №0, X) and the x?-distribution, respectively, and let yn /uy =x — 


в. 


(a) Show that the density of x is 


riia *p)] 


1- La- p)E(x-p) 


Ha+p) ` 


r( In) п2Рт??| X|? 


(b) Show that £x = p and 


E(x- n)x-) = z—53. 


(Sec. 7.8) Prove that Fe is not proportional to f by calculating Fe. 


(Sec. 7.8 Prove for p = 2 


n 0 
TDT' -d,A + (d; -d))|g IAL 


y 


(Sec. 7.8) Verify (17) and (18). [ Hint: To verify (18) let € = KK’, А = KA* K', 
and A* = T*T*, where К and T* are lower triangular] 
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7.4. (Sec. 7.8) Prove for optimal D 


Hi p-2i41 j p even 
&L,U,8) - é Lj(I,TDT') = - Y log} 1 — (ea ; , 


Mp-1) А 2 
-- у ghi - (£222) | p odd. 


i=l 


7.32. (Sec. 7.8) Prove 14,6) and 1. (5, С) are invariant with respect to transfor- 
mations G* = CGC’, X* = CXC' for C nonsingular. 


7.33. (Sec. 7.8) Prove L,(%,G) is a multiple of (g — oa) ®~'(g — o). Hint: Trans- 
form so X = I. Then show 


7.34. (Sec. 7.8) Verify (11). 


7.35. Let the density of Y be Ду) = К for y'y xp - 2 and 0 elsewhere. Prove that 
К = Гр + D/[Cp + 2) ] 7, and show that 4Y —0 and &YY' = 4. 


7.36. (Sec. 7.2) Dirichlet distribution. Let Y,,...,Y,, be independently distributed as 
x variables with p,,...,p, degrees of freedom, respectively. Define 2; = 
YE? Y, = 1... m. Show that the density of Zi... Zm-1 is 


=11р 


т-1 
r(4E7 p) ginh-l ертті, z,21- У Zi 
role) ! i=l 
for 2,20, = 1,.... т. 


7.37. (Sec. 7.5) Show that if ху. and ХА are independently distributed, then 
X& (XS 2 is distributed as ( Хм a »?/4. [ Hint: In the joint density of x= XN 
and y= y<_. substitute z = 2/xy , X =x, and express the marginal density of z 
as 277 Cz). where A(z) is an integral with respect to x. Find h'Cz), and solve 
the differential equation. See. Srivastava and Khatri (1979), Chapter 3.] 


CHAPTER 8 


Testing the General Linear 
Hypothesis; Multivariate 
Analysis of Variance 


8.1. INTRODUCTION 


In this chapter we generalize the univariate least squares theory (i.e., regres- 
sion analysis) and the analysis of variance to vector variates. The algebra of 
the multivariate case is essentially the same as that of the univariate case. 
This leads to distribution theory that is analogous to that of the univariate 
case and to test criteria that are analogs of F-statistics. In fact, given a 
univariate test, we shall be able to write down immediately a corresponding 
multivariate test. Since the analysis of variance based on the model of fixed 
effects can be obtained from least squares theory, we obtain directly a theory 
of multivariate analysis of variance. However, in the multivariate case there is 
more latitude in the choice of tests of significance. 

In univariate least squares we consider scalar dependent variates Xise Xy 
drawn from populations with expected values B’z,,...,B’zy, respectively, 
where В is a column vector of 4 components and each of the Z, is a column 
vector of q known components. Under the assumption that the variances in 
the populations are the same, the least squares estimator of p'is 


N N -1 
(1) b'= | Y sx) а . 
a-i 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
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If the populations are normal, the vector is the maximum likelihood estima- 
tor of В. The unbiased estimator of the common variance c? is 


N 
(2) 52 = Y (x, 2 b'z,) /(N - q), 


and under the assumption of normality, the maximum likelihood estimator of 
c? is д? = (N д)з? /N. 

In the multivariate case x, is a vector, В’ is replaced by a matrix B, and 
c? is replaced by a covariance matrix X. The estimators of B and Х, given 
in Section 82, are matric analogs of (1) and Q). 

To test a hypothesis concerning В, say the hypothesis В — 0, we use an 
F-test. A criterion equivalent to the F-ratio is 


1 б? 


© [pm ер 


where Gj is the maximum likelihood estimator of c? under the null 


hypothesis. We shall find that the likelihood ratio criterion for the corre- 
sponding multivariate hypothesis, say В = 0, is the above with the variances 
replaced by generalized variances. The distribution of the likelihood ratic. 
criterion under the null hypothesis is characterized, the moments are found, 
and some specific distributions obtained. Satisfactory approximations are 
given as well as tables of significance points (Appendix B). 

The hypothesis testing problem is invariant under several groups of linear 
transformations. Other invariant criteria are treated, including the 
Lawley—Hotelling trace, the Bartlett-Nanda-Pillai trace, and the Roy maxi- 
mum root criteria. Some comparison of power is made. 

Confidence regions or simultaneous confidence intervals for elements of B 
can be based on the likelihood ratio test, the Lawley—Hotelling trace test, 
and the Roy maximum root test. Procedures are given explicitly for several 
problems of the analysis of variance. Optimal properties of admissibility, 
unbiasedness, and monotonicity of power functions arc studied. Finally, the 
theory and methods are extended to elliptically contoured distributions. 


8.2. ESTIMATORS OF PARAMETERS IN MULTIVARIATE 
LINEAR REGRESSION 


8.2.1. Maximum Likelihood Estimators; Least Squares Estimators 


Suppose x,,...,xy are a set of № independent observations, x, being drawn 
from МВ, X). Ordinarily the vectors z, (with д components) are known 
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vectors, and the p x p matrix € and the p X q matrix В are unknown. We 
assume N > p +q and the rank of 


(1) 2= (15... zy) 


is q. We shall estimate X and B by the method of maximum likelihood. The 
likelihood function is 


1 1 N | 
(2 1=(2т) ""|x*-? exp| -3 X; (х.- B'z,)'£* (x, Вс.) |. 


а=1 


In (2) the elements of X* and B* are indeterminates. The method of 
maximum likelihood specifies the estimators of X and B based on the given 
sample x}, Z}... хм, zy as the X* and В” that maximize (2). It is conve- 
nient to use the following lemma. 


Lemma 8.2.1. Let 
N N =! 
(3) ве Ў Eaz) 
Then for any p X q matrix Е 


(4) 


i Mz 


N 
(х, Fz,)(x, 7 Fz,) = У (х, T Bza) (Fa ~ В." 
| a-l 
N 
+(B-F) У uz(B-F). 
а=] 
Proof. The left-hand side of (4) is 
N 
a=] 
which is equal to the right-hand side of (4) because 
N 
(6) Y lal Xa Bza) =0 


a=] 


by virtue of (3). a 
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The exponential in L is — } times 
(7) 
` [А 
пх Y В), - B'z,) = tr Bt! У (х, Ве, )(х, 7 Ва.) 


а=1 
4t EX* (B-B')4(B-p"), 
where 
N 
A= Y zz. 

(8) > 
The likelihood is maximized with respect to B' by minimizing the last term 
in (7). 

Lemma 8.2.2. IfA and С are positive definite, tr FAF'G > 0 for F + 0. 

Proof. Let A = HH', G= KK’. Then 
(9) ir FAF'G = tt FHH'F'KK' = ir K'FHH'F'K 

= tr(K'FH)( K'FH)'>0 

for Е 0 because then K'FH #0 since Н and К are nonsingular. и 


It follows from (7) and the lemma that L is maximized with respect to B 
by B* = B, that is, 


à -1 
(10) В = са, 
where 
N 
(11) c= У xz. 
а&=1 


Then by Lemma 3.2.2, L is maximized with respect to Х* at 


^ 


1j А 
= у E (s. - Ba). = Bea)": 


Mm 


(12) 


This is the multivariate analog of 62 = (N —4)s?/N defined by (2) of 


Section 8.1. 


Theorem 8.2.1. if x, is an observation from N(Bz,, X), а = L. .., N, with 
(zi... z4) of rank q, the maximum likelihood estimator of В is given by (10), 
where c = Хх г. and A= Y,z,z',. The maximum likelihood estimator of X, 


мета а 


is given by (12). 
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A useful algebraic result follows from (12) and (4) with F = 0: 


N N 
(13) МУ = У хх. — BAB’ = У x,x,— CA 1C'. 
а=1 


а= 1 


Now let us consider а geometric interpretation of the estimation proce- 
dure. Let the ith row of (xi. ., xy) be xf (with N components) and the ith 
row Of (2;,..., zy) be z* (with N components). Then Y; Вх, being a linear 
combination of the vectors Zee, z is a vector in the q-space spanned by 
zb, z and is in fact, of all such vectors, the one nearest to хў; hence, it 
is the projection of x* on the q-space. Thus xf УВ, is the vector 
orthogonal to the 4-ѕрасе going from the projection of x7 on the q-space to 
x}. Translate this vector so that one endpoint is at the origin. Then the set of 
p vectors xf — Y, B, я... хр У Byz: is a set of vectors emanating from 
the origin. Nô; = (х* — Y, В Хх? – Y; B27)’ is the square of the length 
of the ith such vector, and Nó; = (x* Ў, B,z¥ Xxf — У, 2%), is the 
product of the length of the ith vector, the length of the jth vector, and the 
cosine of the angle between them. | 

The equations defining the maximum likelihood estimator of B, namely, 
АВ’ = C', consist of p sets of 4 linear equations in 4 unknowns. Each set 
can be solved by the method of pivotal condensation or successive elimina- 
tion (Section A.5 of the Appendix). The forward solutions are the same 
(except the right-hand sides) for all sets. Use of (13) to compute N Ê involves 
an efficient computation of ВАВ.. 

Let X,- Gr, xU, B- (bi... b), and В= (Ba... Bp). Then 
Xia = B;z,. and b; is the least squares estimator of B;. If G is a positive 
definite matrix, then tr СУ" (x, — Fz, (x, — Fz,) is minimized by F — B. 
This is another sense in which B is the least Squares estimator. 


8.22. Distribution of B and $ 


Now let us find the joint distribution of B, Gi=1,...,p, g=1,...,q). The 
joint distribution is normal since the Pig are linear combinations of the X,,. 
From (10) we see that 


N 
(14) еВ = Фу ХА! 


а= 1 


N 
= У Ва, А-! = ВАА-1 
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Thus B is an unbiased estimator of B. The covariance between В; and В’, 
two rows of B , is 


(15) 
^ N 


e (Ê -BAB - в;) =47 é У (Xi, 7 Ха) аа Ў (х, X5)2, A^! 


=A! Y E (Xia — EXiq)(Xjy— 6 n) mA 
а, у= 
N 
=A! Y &yonz4. 
а. у= 


-a ди" 


јо 


= о, А” 14A! 
= 0; jA 
To summarize, the vector of ра components (gi... Bi)’ = vec B" is nor- 
mally distributed with mean (f^,..., pj = vec B' and covariance matrix 
А -1 А -1 ... А -1 
9n 912 Tip 
-1 -1 -1 
о А с А Uo OÁ 
(16) 05. P 
-1 — 1 _ 
0,4 ! 0,4 Uo eA 
The matrix (16) is the Kronecker (or direct) product of the matrices X and 


A^!, denoted by &4A''. 

From Theorem 43.3 it follows that NÊ = X x, x, — BAB’ is dis- 
tributed according to W(X, N — q). From this we see that an unbiased 
estimator of X is 5 -[N/(N — qu. 


Theorem 8.2.2. The maximum likelihcod estimator B based on a set of N 
observations, the ath from N(Bz,, X), is normally distributed with mean В, and 
the covariance matrix of the ith and jth rows of B i: 9А 1 where А =È uZata: 
The maximum likelihood estimator Ê multiplied by М is independently dis- 
tributed according to W(Z, N — q), where 4 is the number of components of z, 
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The density then can be written [by virtue of (4)] 


(17) ехр(-- (х7 (В - B).«(B - By «N$])). 


1 
This proves the following: 
Corollary 8.2.1. B and $ form a sufficient set of statistics for B and È. 
A useful theorem is the following. 


‘Theorem 8.2.3. Let X, be distributed according to N(Bz,, X), a = 1..... N, 
and suppose X,,..., Xy are independent. 


(a) If ж, = Hz, and Г= BH'^', then X, is distributed according to 


NT w, X). 

(b) The maximum likelihood estimator of T based on observations х, on X... 
а=1,..., №, is Ѓ = ÊH- ! where B is the maximum likelihood estima- 
tor of B. 


(с) fGLw,wof'- B4B’, where А = X,z,z,, and the талтит likeli- 
hood estimator of МХ is МХ = rx, x, -Ew wÊ = Ххх, 
BAB’. 

(а) f and $ are independently distributed. 


(е) Г is normally distributed with mean Г and the covariance matrix of the 
ith and jth rows of Ё is o,(HAH') ! = a;H' A" HT 1. 


The proof is left to the reader. 
An estimator F is a linear estimator of В, if F = YN fox. It is a linear 
unbiased estimator of В,, if 
N N N ра 
(18) B,-74F-6 Y fx У ГВ. = X Y Y fj Вива 


a=! azl a=] j=l h=) 


is an identity in P, that is, if 


N 
(19) У fieZha= 1, jai, h=g, 
а= 
= 0, otherwise. 


A linear unbiased estimator is best if it has minimum variance over all linear 
unbiased estimators; that is, if &(F — B, < &(G — Bi for G = XN в ха 
and &G = Big 
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Theorem 8.2.4. The least squares estimator is the best linear unbiased 
estimator of Big. 


fia X i i imator of 
Proof. Let Big = Loe aa 1J ja Y ja be an arbitrary unbiased esti 

B;,, and let В, = zEN 11 1х: 2на@ а"& be the least squares estimator, where 
ig? i 


А = XN iz,2,. Then 


2 
= e( Big Bie) + 26( Êi- В: 8 Big Bis) + i -À&) у 


Because Bi, and B, are unbiased, Bi, — Big BEN Nate 1 fjalt ja> Big — Big = 
; h 
УУ 154.00: 4 5 and 


- 6; X 2р. ju, 


] А N p 
(21) Ве Bie = » Es h-i 


where 5; = 1 and 8; = 0, i j. Then 


о) — e(B,- &)( Bie Êi) 


Il 
S 
2 
{Ма 
[= 
N 
= 
R 
> 
= 
oq 
= 
R 
- 

Ul 
_—. 
m 

~ 
50 
5 
Me 
N 
= 
3 
a 
= 
on 
— 
Ew 
E 
< 


ll 
М 
Mm 
M= 
N 
2 
R 
= 
oq 
—— 
I 
R 
> 
М = 
D 
R 
& 
— 
:9 


Then (20) implies 2068, — Big)” = EC Big — Big) 8 
83. LIKELIHOOD RATIO CRITERIA FOR TESTING LINEAR 
HYPOTHESES ABOUT REGRESSION COEFFICIENTS 


8.3.1. Likelihood Ratio Criteria 


Suppose we partition 


(1) В = (B. B2) 


8.3 LIKELIHOOD RATIO CRITERIA FOR REGRESSION COEFFICIENTS 299 


so that B, has q, columns and B, has q, columns. We shall derive the 
likelihood ratio criterion for testing the hypothesis 


(2) . Н: В, = Bi. 

where Bj is a given matrix. The maximum of the likelihood function L for 
the sample x,,..., xy is 

(3) maxL = (27) "Зы em, 


where $, о 15 given by (12) or (13) of Section 8.2. 
To find the maximum of the likelihood function for -the parameters 
restricted to w defined by (2) we let 


(4) y, = Xa — Biz, a=1,...,N, 
where 

2) 
5 zo7| Ab a=1,...,N, 
(5) 2l 


is partitioned in a manner corresponding to the partitioning of B. Then y, 
can be considered as an observation from МВ, 20, X). The estimator of В, 
is obtained by the procedure of Section 8.2 as 


N N 
(6) Вы = D JaAn = Y (x, Biz?) а 
a=] a=] 
= (С, – Вї4,,) 45 
with C and А partitioned in the manner corresponding to the partitioning of 
B and z, 


(7) С= (С, C;), 
Аи Ар 

8 А= 

(8) Ay Ay 


The estimator of X is given by 


N 


(9) МУ. (Ya ~ B,.z2)(y, 7 В, 2)’ 


| 
М г 
< 


Yam B;,A5B;, 


NU ^ ^ 
122) (x. 7 Biz?) — В, An Bs. 


І 
UE 
— 
* 
R 
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Thus the maximum of the likelihood function over « is 


(10) maxL = (22) PET Pec BN, 
B» X 


The likelihood ratio criterion for testing H is (10) divided by (3), namely, 


N 


n= 


NE 
i£] 


№‘ 


Nie 


(11) 


In testing H, one rejects the hypothesis if А < Ay, where A, is a suitably 
chosen number. 

A speciai case of this problem led to Hotelling’s T?-criterion. If д =q, = 1 
(42=0), z,=1, a=1,...,N, and B- В, = p, then the T?-criterion for 
testing the hypothesis 4 = py is a monotonic function of (11) for BT = ро. 

The hypothesis р = 0 and the T”-statistic are invariant with respect to the 
transformations X* = DX and x* =Dx,, a=1,...,N, for nonsingular D. 
Similarly, in this problem the null hypothesis B, = 0 and the likelihood ratio 
criterion for testing it are invariant with respect to nonsingular linear 
transformations. 


Theorem 8.3.1. The likelihood ratio criterion (11) for testing the null 
hypothesis В, = 0 is invariant with respect to transformations x; = Dx,, a= 
1,..., N, for nonsingular D. 


Proof. The estimators in terms of x? are 


(12) В =рса-! =рВ, 
N . 
(13) Л È (Dx, - DBz,)( Dx. - DBz,)' =DE_D’, 


а=1 


(14) B, = DC; Az! = DB,,,, 


N 
(15) i-t Y. (Dx, - DB,,,z)( Dx, - DÉ,,z2)' -DÉ,D'. в 
а=1 


8.3.2. Geometric Interpretation 


An insight into the algebra developed here can be given in terms of a 
geometric interpretation. It will be convenient to use the following lemma: 


Lemma 8.3.1. 


(16) В, – В.о = (Bia - Br) АА. 
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Proof. The normal equation B, A = C is written in partitioned form 


(17) (Bio dn + Boo 42 Bio 4: + Bian] =(C,,C2). 


Thus Во = С.А» – Bio4: 451. The lemma follows by comparison with 
(6). a 


We can now write 


(18) X - BZ = (Х- BaZ) + (Boo – Br) 2: + (Bia - Pr): 
= (X - Êa Z) + (Br. ~ Br)22 
- (В, – Bra) 2 + (Bio - Вт) 2, 
= (х- Во?) + (Bo. В:)2: 
+ (Bia - В!) (2, -4247 Z2) 


as an identity; here X = (x... xy) Z= (fP, - 2D, and 2, = 
(z?,..., 29). The rows of 2 = (Zi, Z,)' span a q-dimensional subspace in 
N-space. Each row of BZ is a vector in the q-space, and hence each row of 
X — BZ is a vector from a vector in the g-space to the corresponding row 
vector of X. Each row vector of X — BZ is expressed above as the sum of 
three row vectors. The first matrix on the right of (18) has as its ith row a 
vector orthogonal to the q-space and leading to the ith row vector of X (as 
shown in the preceding se tion). The row vectors of (B; „- В)2; are vectors 
in the q;-space spanned by the rows of Z, (since they are linear combinations 
of the rows of 22). The row vectors of (Bio - Bi XZ, - 4 42 Z,) are 


` vectors in the g,-space of 2; — 4n 45212, and this space is in the g-space of 


Z, but orthogonal to the q,-space of Z; [since (Z, — 4,4427 Z,)Z; = 0]. Thus 
each row of X — BZ is indicaied in Figure 8.1 as the sum of three orthogonal 
vectors: one vector is in the space orthogonal to Z, one is in the space of Z;. 
and one is in the subspace of Z that is orthogonal to 2.. 


Zi 


-1 
) (Ва 1002: - An An 22) 


Z2 
BZ (Bz,,— B2)Z2 
Figure 8.1 
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From the orthogonality relations we have 
(19) (X- BZ)(X- BZ)’ 
= (Х- Baz)(X - aZ} + (B. - Вз) 2.25 (Вь. - В»)! 
+ (Bio - В!) (2, -4242 Z)(Z – An Az) Z;) (Bio - BY)’ 
= NÉ, + (Bi, ~ B;) An (B, - Be)’ 
+ (Bio - В!) (Аи -Ap AzlAa )(Bia — В)”. 
If we subtract (В, — B,)Z, from both sides of (18), we have 
Q0) X - BZ, - Bj, Z; = (X- BaZ) + (Bio - В1)(2.-424525). 
From this we obtain 
(21) N$,-(X-BiZ – B,Z)(X- BIZ, - В,.2,) 
= (х- Вь2)(Х - BaZ) | 
+ (Bio - В!) (2, -4 245 Z)(Z - Ap Ax! Z;) (Bia - Bi)” 
= NÉ, + (Bio - В!) (Аи 7 4n 4242) (Bio — Br)’: 


The determinant |Ê „l| = (1L/NP)(X = Во2).Х-Во2)'| is proportior.al 
to the volume squared of the parallelotope spanned by | the row vectors of 
xX- Baz (translated to the origin). The determinant |È | = (1/NP)I(X — 
B:Z, - B5, Z'XX - BIZ, - B,,,Z2)'| is proportional to the volume squared 
of the parallelotope spanned by the row vectors of X — BiZ, - É, „Z, (trans- 
lated to the origin); each of these vectors is the part of the vector of 
X — B1Z, that is orthogonal to Z,. Thus the test based оп the likelihood ratio 
criterion depends on the ratio of volumes of parallelotopes. One parallelo- 
tope involves vectors orthogonal to Z, and the other involves vectors orthogo- 


nal to Z.. 
From (15) we see that the density of xj,..., xy can be written as 


u(x-! [м5 + (В.. - Bz) 45 (Ba. T В) 


ni- 


1 
2 7 П i 
(22) (27) "| E12" op 


+ (B... = Bi)(4u -An AglAn (Bio ~ Br)’|}). 


Thus, È, B4, and B,,, form a sufficient set of statistics for X, B,, and Bp. 
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Wilks (1932) first gave the likelihood ratio criterion for testing the equality 
of mean vectors from several populations (Section 8.8). Wilks (1934) and 
Bartlett (1934) extended its use to regression coefficients. 


8.3.3. The Canonical Form 


In studying the distributions of criteria it will be convenient to put the 
distribution of the observations in canonical form. This amounts to picking a 
coordinate system in the N-dimensional space so that the first g, coordinate 
axes аге in the space of Z that is orthogonal to Z,, the next q, coordinate 
axes are in the space of Z,, and the last n (= № — 4) coordinate axes are 
orthogonal to the Z-space. 

Let Р, be a 42 Xq, matrix such that 


(23) I-P,A5P,—(P,2;)(P52;),, 
and let P, be a q, Ха, matrix such that (4. — 44, -Ap 4514) 
(24) Т=Р, А.Р = [ P,(Z, — An A3] Z;)] [P,(Z, -An A3) Z,)] '. 


Then define the N x N orthogonal matrix Q as 


Q, P,(Z, -An Ax] Z3) 
(25) Q-|Q;|- PZ, > 
Q; Q; 


where Q, is any n X N matrix making О orthogonal. Then the columns of 
(26) W-(W W, W;)=xQ'=x(Qi Q, 0%) 


are independently normally distributed with covariance matrix € (Theorem 
3.3.1). Then 


(27) EW, = 4XQ, = (B,Z, + В,2,)(2, - Ap Ag Z;)' P 
= В,4,,.:Р = BP’, 

(28) e W, = 6 ХО, = (B:Z, + B222) P; 
= (Bii? + B; Az) P5, 

(29) EW; = & XQ’, = В20, = 0. 
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Let 

(30) T, = (y, Yg) = В.412Р = BPT’, 

(31) Г, = (+1: Ya) = (В.А р + B242) Pa, 


(32) у= (№, W, №) = Qu... а Wa elisa Фа» aao es WN) 


Then ю;,. La wy are independently normally distributed with covariance 
matrix X and éw,=y,, a — 1,...,g, and &w,=0, а=а+1,..., М. 

The hypothesis В, = Bj can be transformed to В, = 0 by subtraction, that 
is, by letting x, — Виз) =y,, as in Section 8.3.1. In canonical form then, the 
hypothesis is Г, = 0. We can study problems іп the canonical form, if we 
wish, and transform solutions back to terms of X and Z. 

In (17), which is the partitioned form of B, A = C, eliminate Boo to obtain 


(33) Byo( Ai — Aj, AA) = С; = C Ax Ay 
-X(Z — 2,454.) 
-WPU; 


that is, W, = Во А.Р! = BigP; ! and Г, = BP; '. Similarly, from (6) we 
obtain 


(34) Bw An tBrAo = С, 7XZ; = WP; '; 


that is, m= (Bo. 45 + BA; P; = By. Py) + Вт, Рг! and Г, = B;P;! + 
B A), Py! 


8.4. THE DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 
WHEN THE HYPOTHESIS IS TRUE 
8.4.1. Characterization of the Distribution 


The likelihood ratio criterion is the Nth power of 


IÊ [Na + (Bio - BY) Ан. (Bis В)" 


where 4. =A, — 445142. We shall study the distribution and the 
moments of U when В, = Bi. It has been shown in Section 8.2 that NY, is 
distributed according to W(X,n), where n — N —q, and the elements of 
Во — B have a joint normal distribution independent of N So. 


(1) U= AVN = 1.1 INE 
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From (33) of Section 8.3, we have 
(2) (Bio ~ Bi) Ан. (Bio 7 Bi) =(W,- D) P, Ag? Pi (W, -ny 
= (ж, - F)(9, Г)", 


by (24) of Section 8.3; the columns of m- Г, are independently distributed, 
each according io №0, X). 


Lemma 8.41. (Bin — BA aBn — BiY is distributed according to 
С, 41). 


' Lemma 8.4.2. The criterion И has the distribution of 


|G| 


(3) U= "65m" 


where G is distributed according to W(X,n), H is distributed according to 
W(X, m), where т = 41, and G and H are independent. 


Let 
(4) G-N$,-XX' - XZ(ZZ) ` ZX', 
(5) G+H=N3q+ (Bio - BT) An. (Bio - Bi) 
= МЎ = – 2302,25)  ZY' 
where Y -X – BLZ, = X - (Bi. 02. Then 
(6) G -YY -YZ' (ZZ) ZY.. 


We shall denote this criterion as О, m.n» Where p is the dimensionality, 
m=q, is the xumber of columns of B,, and п= N- q is the number of 
degrees of freedom of G. 

We now proceed to characterize the distribution of U as the product of 
beta variables (Section 5.2). Write the criterion U as 


(7) ИИ, т 0, 


where V, = gu/(gy t ^u^ 


ІС IG; + Hi j=? 
= - ; =2...., р, 
(8) И IG; L| 1G) + HM 
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and G, and H, are the submatrices of G and H, respectively, of the first i 
rows and columns. Correspondingly, let y? consist of the first і components 
of y, =x, — В =), а= 1,..., М. We shall show that V; is the length squared 
of the vector from у? = yj... Yin) to its projection on Z and ¥_,= 
(yi, ..., yf?) divided by the length squared of the vector from yf to its 


projection on Z, and Y;.,. 


Lemma 8.4.3. Let y be an N-component row vector and U an r X N matrix. 
Then the sum of squares of the residuals of y from its regression on U is 


уу У’ 
Uy UU' 
JUU'| 


(9) 


Proof By Corollary A.3.1 of the Appendix, (9) is yy’ -yU' (UU) Оу", 
which is the sum of squares of residuals as indicated in (13) of Section 82. 
a 


Lemma 8.4.4. V, defined by (8) is the ratio of the sum of squares of the 


residuals of y... Yin from their regression on yl, ..., YR P. and Z to the 
sum of squares of residuals of y;,,---» Yin from their regression on yD JETP 
and Z;. 


Proof. The numerator of V; can be written [from (13) of Section 8.2] 


IG, Ix — 0270227) 211 


i 


1G;_1| | Y} — Y, Z'(ZZ') 'ZY.l 


xx ¥2'| |, 
ZY) 22’ 


Y; Y- Yy' У 


* yk! * 


XY. b xz 
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Didi xx 2] 


ida 


-1 
Y X, У’ Y 
= pt yt! — y*( Y! 7' 1121-1 1—1 i-1 ж. 
Ji Ji »( i-1 | ZY; ZZ Z Ji 


by Corollary A.3.1. Application of Lemma 8.4.3 shows that the right-hand 
side of (10) is the sum of squares of the residuals of y* on У_, and Z. The 
denominator is evaluated similarly with Z replaced by Z;. a 


The ratio V, is the 2/Nth power of the likelihood ratio criterion for 
testing the hypothesis that the regression of уг =x* — B*Z, on Z is 0 (in 
the presence of regression on Y, , and Z,); here В“, is the ith row of Bj. For 
i=1, gj is the sum of squares of the residuals of yt = Gp. Yin) from its 
regression on Z, and gy + hy is the sum of squares of the residuals from 2. 
The ratio V, = gi /(gi * hi), Which is approximate to test the hypothesis 
that regression of y? on Z, is 0, is distributed as д2 /( x; + х2) (oy Lemma 
84.2) and has the beta distribution (v; зп, зт). (See Section 5.2, for 
example.) Thus V; has the beta density 


(11) Blu;3(n+1-2),3m| 
r[i(n+m+1 - i)] 


7 T[n41-i]r(im) 


for 0 <v < 1 and 0 for v outside this interval. Since this distribution does not 
depend on Y, ,, we see that the ratio И; is independent of ¥,_,, and hence 


i 


independent of |V,..., V; .,. Then V,,..., V, are independent. 


1-01 _ оу"! 


> 


Theorem 8.4.1. The distribution of U defined by (3) is the distribution of the 
product ПРИ, where V,,...,V, are independent and V; has the density (11). 


The cdf of U can be found by integrating the joint density of V,,...,V, 
over the range 


р 


(12) Пи<и. 


і=1 
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We shall now show that for given N—q, the indices p and q, can be 
interchanged; that is, the distributions of U,, y, .,, , —U,,,, and of 
О N-q;-p = О, рап+т-р are the same. The joint density of С and W, 
defined in Section 8.2 when X = 7 and B, = 0 is 


IGI in-p-1) eg m6- str ww 


21Р) / р T[3(n e 1— i27)?" | 


(13) 


Let G+ WW, =J = CC' and let W, = CU. Then 


в Iec'-cuv'c| _ , 
(14) О m,n = IG + WWil 7 [CC'] = 11, UU | 
I, U In U L-UU 
OU L| [JU I, = И, - UU; 


the fourth and sixth equalities follow from Theorem A.3.2 of the Appendix, 
and the fifth from permutation of rows and columns. Since the Jacobian of 
W, = CU is mode|C|” = |J] 2”, the joint density of J and U is 


Л Hn+m-p-1 )g-3uJ 
21014 mpg pio-D/*T]p [1n e m 1- i)] 


(15) 


TI 


i=l 


T[i +m +1- i] | f, = 00-0 
T[5(n*1-i)] qp 

for J and Г, — UU’ positive definite, and 0 otherwise. Thus J and U are 

independently distributed; the density of J is the first term in (15), namely, 

(Л lI, n + т), and the density of U is the second term, namely, of the form 


(| 3n-p-1) 
(16) күи, — ШОР 


for Г, — UU' positive definite, and 0 otherwise. Let J, = И", p* = m, m* =p, 
and n* = п + m — p. Then the density of U, is 


(17) KII, — UU, 2°?" P 


for I, — U,U, positive definite, and 0 otherwise. By (14), |Z, — U,U,| = 
Ни — О. О, |, and hence the density of U, is 


(18) КИ» = U,U; | PY, 


09 
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which is of the form of (16) with p replaced by р*=т, m replaced by 
m* — p, and n — p — 1 replaced by n*-p*-1-n-p-l Finally we note 
that О, mn given by (14) is Hm — 0,0,1 = Un, pne m-p: 


Theorem 8.4.2. When the hypothesis is true, the distribution of U, 4, N-41- i , 
is the same as that of О, p. N-p-42 (i.e., that of U, m,n is that of Un. p.n+m-p > 


8.4.2. Moments 


Since (11) is a density and hence integrates to 1, by change of notation 


bi _ I(a)T (5) 
(19) Дита) dv = B(a,b) = T(a*b)' 


From this fact we see that the Ath moment of V, is 


1 T[(n +т+1 - i) ита (р — py! dv 
[ й = хе 
e» ev^- Принт) 
r[i(n 1-i)h]r[3(n m *1- i] | 
= piar am i) +h] 
Since V,,...,V, are independent, ФИ” = enr И = ПР ёи". We obtain 
the following theorem: 
Theorem 8.4.3. The hth moment of Ulif h > — Mn+1—p)lis 
ro E T[i(n+1-i) *h]r[s(n*m*1-2)] 
О ЛИ В А 
(21) éU"= П Г (и +1 - ijrii(n *m1-i) +h| 


P r[;(N-4a-c«-t! =i) +AT a+) 
ТЕ (а 4+1 DTN -9+1-9 +h] 


In the first expression p can be replaced by m, m by p, and n by 


‘п+т-р. Е 
Suppose p is even, that is, p = 2r. We use the duplication formula 


УтГ(2а + 1) 
(22) rat Гат. 
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Then the Ath moment of Uz, m.n is 


r 


BUE yg = | 


lr.m.n 
j=l 


T[5(m +n+2) - j] r[i(m n +1) - j] 
Г (тъп +2) + А] [т n9 1) -j +h] 


[in 2) -j th] P [srt 1) t) 
T[5(n +2) -ria +1) -Л 

тт [С(т+и+1- 23) (п +12 + 28) 

ГЕОЛ у 


It is clear from the definition of the beta function that (23) is 


Ги ГОт ни +123) аворо Lyn 
eG» Пита)" 0 0-20 4 


r r h 
cr 2h 2 
= [IY en] А 
j=l jel 


where the Y, are independent and Y, has density BCy; n + 1 — 2j, m). 
Suppose p is odd; that is, p = 25 + 1. Then 


h 
5 
= САН; Le 2 
(25) EU ss mn [Пан] E 
i= 


where the Z, are independent and Z, has density В(2; п +1 — 2i, m) for 
i=1,...,s and Z,,, is distributed with density B[z;(n + 1 — p)/2, m/2]. 


Theorem 8.4.4. 0, „ „ is distributed as T1], Y, where Y,...,Y, are 
independent and У. has density B(y; n 1 — 2i, m), О, т.п l5 distributed as 
ПЗ, 227,1, where the Z;, i— 1,...,s, are independent and Z; has density 
B(z;n- 1- 2i, т), and Z,,, is independently distributed with density B[z;5(n 
+ 1-р), 1%]. 


8.4.3. Some Special Distributions 


р=1 
From the preceding characterization we see that the density of И m,n is 


F[5(n € m)] 


es Tram) 


и"-1 (1 _ uj?! . 


- Q8) 
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Another way of writing U, m,n is 


1 1 
27 Um n= TALL = 
(27) d 1+5 12/8  de(m/n)EQQ 


where g,, is the one element of С = МЎ, and Fa,» is an F-statistic. Thus 


1- U т п 


Theorem 8.4.5. The distribution of |(1 — U, ,, ,)/U, , lr n/m is the 
F-distribution with т and п degrees of freedom; the distribution of 
(1 — U, , ,/U, Ll (n 1— p)/p is the F-distribution with p and n -1—p 
degrees of freedom. 


p-2 
From Theorem 8.4.4, we see that the density of JU, m,n is 


T(n * m ~ 1) 


Г(л – Оту" C =x)", 


(29) 


and thus the density of U, m,n is 


T(ntm—l1) i.a т-1 
(30) | =m) (1-vu)y” . 


From (29) it follows that 


1- y Us. n.n n-1 
VU, ss т = Рот, и-р- 
2,m,n 


Theorem 8.4.6. The distribution of [@ — YUz ,, ,)/ VU, l(a — D/m 


is the F-distribution with 2m and 2(n — 1) degrees of freedom; the distribution 
of (A — JU, „И JU,» V (n 1 — р)/р is the F-distribution with 2p and 


(31) 


2(п + 1— p) degrees of freedom. 


p Even 

Wald and Brookner (1941) gave a method for finding the distribution of 
U, тп for p or m even. We shall present the method of Schatzoff (19662). It 
will be convenient first to consider U, ,, , for m = 2r. We can write the event 
П? <и as 


(32) Y, + +Y, > -logu, 
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where Y;,..., Y, are independent and Y; = —logV, has the density 
r-1 А = 1 

(33) К, e Xnti-i»(1 е2)! =K} o| 7 J eie 
E j 
j=0 

for 0 € y < œ and 0 otherwise, and 


(34) g- Ги +10 +] 1 H atl-i+2j 


' F[i(n1-D]|r() C=! П 2 


The joint density of Ү,,...,У, is then a linear combination of terms 


expl- Ўр. ,а;у,). The density of W, = Y]. Y; can be obtained inductively from 
the density of W,_, = Y/-1 Y, and У, j =2,..., p, which is a linear combina- 


tion of terms w£ , e*"i-1*^;"i, The density of W, consists of linear combina- 


tions of 
w, wk 
(35) ef wk eTa dw = eam). Ex т if a; =с, 
k k-h 
k! и’ 
LZ -Di i 
LÍ ) (АИ)! (c -aj)^*! 
! 
+(=1)** ем if a,#c. 
(с-а,) 
The evaluation involves integration by parts. 
Theorem 8.4.7. If p is even or if т is even, the density of U, can be 


p,m,n 
expressed as a linear combination of terms ( — log uY'u!, where К is an integer and 


l is a half integer. 


From (35) we see that the cumulative distribution function of —log U is a 
linear combination of terms w* e^!" and hence the cumulative distribution 
function of U is a linear combination of terms (— log u)*u!. The values of k 
and / and the coefficients depend on p, m, and n. They can be obtained by 
inductively carrying out the procedure leading to Theorem 8.4.7. Pillai and 
Gupta (1969) used Theorem 8.4.3 for obtaining distributions. 


13 
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An alternative approach is to use Theorem 8.4.4. The complement to the 
cumulative distribution function U5, m,n 18 


(36) 


Pr(U,, тя 24} = Pr > Va} 


i=l 


= Sr UC fs II8(yin +1 — 2i, m) ду," dy, dy). 
woe Dx 
In the density, (1 — y)" ^! can be expanded by the binomial theorem. Треп 
all integrations are expressed as integrations of powers of the variables. 
As an example, consider r — 2. The density of Y, and Y, 15 
m-i m-1 

GD Qy a-y)" A ya) 

mot Ки ПС" 
=c È Сит) ПИЛ! 

LJ” 


n-2+iyn-4+)} 


yi , 


where 


Г(п+т- (пт 3) 
(38) cs T(n— 1)T(n—-3)0?(m) 


The complement to the cdf of Us m,n is 


"o [(m- D^ 
(39) РО m nzu) =C Le Go D(m-j = ПИЛ 
i,j= 


f, Г. yt? уз 11 dy, dy, 
yu" fu» 


m-1 [(m - )] C22) 


< (m-i- Ти с) ИЛИ 3+7 


itj 


и 


The last step of the integration yields powers of Ум and products of powers 
of Vu and logu (for 1--i-j- – 1). 
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Particular Values 

Wilks (1935) gives explicitly the distributions of U for p=1, p=2, p=3 
with m=3; p=3 with m —4; and p=4 with m — 4. Wilks's formula for 
р= 3 with m = 4 appears to be incorrect; see the first edition of this book. 
Consul (1966) gives many distributions for special cascs. Sec also Mathai 
(1971). 


8.4.4. The Likelihood Ratio Procedure 


Let up ,, (a) be the a significance point for U, 


p.m.n? 


that is, 


(40) Pr(U, mn <и 


p.m.n 


p.m (O1 H true] =а. 


It is shown in Section 8.5 that —[и — 3( p — т + 1)]log 0, m.n has a limiting 
x -distribution with pm degrees of freedom. Let y,,,(a@) denote the a 
significance point of ҳу, and let 


— [n —-i(p-mc 1)] log Uy mal &) 


C 1 = 2 
(41) р.т.п-р+ (а) Хр) 


Table B.1 [from Pearson and Hartley (1972)] gives value of C, „ y(a) for 
а = 0.10 and 0.05, p = 1(1)10, various even values of т, and М =п-р+1 
= 1(1)10(2)20. 24, 30, 40, 60, 120. 

To test a null hypothesis one computes U,,,, and rejects the null 
hvpothesis at significance level a if 


(32) ~ [п T ip -m+ 1)] log О mn > C, min -p+t(®) Xpm( &)- 


Since C, ,, (a) > 1, the hypothesis is accepted if the left-hand side of (42) is 
less than. x5, (a). 

The purpose of tabulating C, „ м(а) is that linear interpolation is reason- 
ably accurate because the entries decrease monotonically and smoothly to 1 
as M increases. Schatzoff (1966a) has recommended interpolation for odd p 
by using adjacent even values of p and displays some examples. The table 
also indicates how accurate the y -approximation is. The table has been 
extended by Pillai and Gupta (1969). 


8.4.5. A Step-down Procedure 


The criterion U has been expressed in (7) as the product of independent beta 
variables V,,V>,...,V,. The ratio V, is a least squares criterion for testing the 
null hypothesis that in the regression of x? — ВИ, on 2 = (21 Z5) and 
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. X,_, the coefficient of Z, is 0. The null hypothesis that the regression of X 


оп Z, is Bj, which is equivalent to the hypothesis that the regression of 
X — В*2, on Z, is 0, is composed of the hypotheses that the regression 
of х? — В* Z, on Z, is 0, i — 1,..., p. Hence the null hypothesis B, = Bi can 
be tested by use of Vis... Vp 


Since V; has the beta density (11) under the hypothesis B; = ВЯ, 


1-Vin-i-1 


(43) i 


V, m 


i 


has the F-distribution with m and и —i+ 1 degrees of freedom. The step- 
down testing procedure is to compare (43) for і= 1 with the significance 
point F, ,(€,); if (43) for i= 1 is larger, reject the null hypothesis that the 
regression of xf — BAZ, оп Z, is 0 and hence reject the null hypothesis that 
B, = В“. If this first component null bypothesis is accepted, compare (43) for 
i=2 with Е, ,.,(e;). In sequence, the component null hypotheses are 
tested. If one is rejected, the sequence is stopped and the hypothesis B, — Bi 
is rejected. If all component null hypotheses are accepted, the composite 
hypothesis is accepted. When the hypothesis B, = В" is true, the probability 
of accepting it is ПР? (1 — &;). Hence the significance level of the step-down 
test is 1 — ПР. (1 а). 

In the step-down procedure the investigator usually has a choice of the 
ordering of the variables! (i.e., the numbering of the components of X )anda 
selection of component significance levels. It seems reasonable to order the 
variables in descending order of importance. The choice of significance levels 
will affect the power. If є; is a very small number, it will take a correspond- 
ingly large deviation from the ith null hypothesis to lead to rejection. In the 
absence of any other reason, the component significance levels can be taken 
equal. This procedure, of course, is not invariant with respect to linear 
transformation of the dependent vector variable. However, before carrying 
out a step-down procedure, a linear transformation can be used to determine 
the p variables. 

The factors can be grouped. For example, group x,,...,x, into one set 
and x,,,,..., x, into another set. Then Uk, m,n = ПА ,V, can be used to test 
the null hypothesis that the first k rows of B, are the first А rows of В". 
Subsequently FIZ}, V; is used to test the hypothesis that the last p — k 10ws 
of В, are those of В"; this latter criterion has the distribution under the null 
hypothesis of U,., „п-к: 


*In some cases the ordering of variables may be imposed; for example, x, might be an 
observation at the first time point, x; at the second time point, and so on. 
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The investigator may test the null hypothesis B, = Bj by the likelihood 
ratio procedure. If the hypothesis is rejected, he may look at the factors 
Vis... V to try to determine which rows of B, might be different from В". 


The factors can also be used to obtain confidence regions for 8,,,...,p pr 
Let u(ej) be defined by 


(44) D ETE og rei): 


Then a confidence region for B, of confidence 1— £; is 


xi xk! xP X; х2" 
Хх ХХ. ХЕ’ 
2х*' ZX! Zz' 
Co SO St 
(х* — BaZi)(x? — Bazy (xf — BaZi)Xi, (x? = BaZi)Z; 
X, (xf — BaZ, y ХХ) X12; 
Z(xf -BaZ y Z;Xi., ZZ 


Xi-iXi.i 2.125 

Z,Xiı — Z,X, 
ХХ X; .,Z' 

ZX: Zz' 


= ui ej). 


8.5. AN ASYMPTOTIC EXPANSION OF THE DISTRIBUTION 
OF THE LIKELIHOOD RATIO CRITERION 


8.5.1. General Theory of Asymptotic Expansions 


In this section we develop a large-sample distribution theory for the criterion 
studiea in this chapter. First we develop a general asymptotic expansion of 
the distribution of a random variable whose moments are certain functions of 


gamma functions [Box (1949)]. Then we apply it to the case of the likelihood 
ratio criterion for the linear hypothesis. 


We consider a random variable W (0 « W « 1) with hth moment' 


Пё у \ пе T [x (135) + 
(1) ew xr Пета) жар h=0,1,..., 
kite} ПА Гу +h) + n] 


"In all cases where we apply this result, the parameters x,, £,, yj, and n; will be such that there 
is a distribution with such moments. 
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where К is a constant such that «И? = 1 and 
a b 

(2) L X7 L Yj. 
k=) j=l 


It will be observed that the Ath moment of А = U; n is of this form 


where x,=4N=y,, & = 3(-9+1-®), a= -q +1 -Л, а =b=p. We 
treat a more general case here because applications later in this book require 
it. 

If we let 


(3) М = -2logW, 
the characteristic function of oM (0 <p < 1) is 
(4) Ф) е" 

= ew-?'e 


nup -2ip агра 20р) + é) 
| ITA n [ya —2ир) + | 


Here p is arbitrary; later it will depend on №. If a =b, x, T7Ye & < т men 
(1) is the Ath moment of the product of powers of variables wit ben 
distributions, and then (1) holds for all h for which the gamma function 
exist. In this case (4) is valid for all real t. We shall assume here that @) nos 
for all real t, and in each case where we apply the result we shall verify this 
assumption. 


` (5) @(t) = log é(r) =8(1) -8(0), 


b 

a 

g(t) 2 2itp L x, log x, — Y. yj log y; 
k-l jz! 


+ Y log Г[ px, (1 — 2it) + B, + é] 
k=l 


b 
-y log I | py;(1 - 2it) + ej + |, 
ј=1 


= (1 — p)y,. The form g(t) — 200) makes Ф(0) = 
where В, = (1 — p)x, and e; = (1 — py; 
0 which agrees with the fact that К is such that $(0) = 1. We make use of an 
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expansion formula for the gamma function [Barnes (1899), p. 64] which is 
asymptotic in x for bounded h: 


(6) log F(x +h) = log/2z + (x ch ~ у) орх х 


ia r В, .1(А) 
7 LCD r(r+1)x° *Ra(x), 


where! К, (х) = O(x-U*P) and B,(h) is the Bernoulli polynomial of 
degree r and order unity defined by* 


(7) 7] = Y ВЕ). 


The first three polynomials are [ B;(A) = 1] 
B(h)-h- D 

(8) В.(ћ) =h? —h+ s, 
B,(h) =k — 3h? + Gh. 


Taking x = px,(1 — 2it), py(1—2it) and й = В, + Ek, ej + 1; in turn, we 
obtain 


(9) Ф(:) =Q -g(0) – 1f log(1 — 20) 


m a 
*Yo(1-2i) + Y, O(x,"*»)« 
k=1 


r=] 


b 
L(y"); 


j=l 


where 


в f= -2{ Ee En a-o), 


, r(r t 1) k (px) j ( ру;)' 
(12) — Q-i(a-b)log2z- $f log p 


*Y(Qut&4- 3)log x, — Y (>; + 7 z )log yj. 
k j 


_ 7+1 В (Gs; | 
Qn а= Саа расса) 


‘Ви. (x) = OC" D) means Ix" *  R, (х) is bounded as |х| > оо. 

*This definition differs slightly from that of Whittaker and Watson [(1943), р. 126], who expand 
re" - D/GI - 1). If BID is this second type of polynomial, BA) = BO) — +, В (А) = 
Bt.(h)+(-1)*'B,, where B, is the rth Bernoulli number, and By, 4 (A) = B2, (A). 
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One resulting form for (t) (which we shall not use here) is 
-i Z - 
(13) b(t) 290 = e2-# (1 — 2i) Y а (172i) " Ra, 
v=0 


where Y" qa,z^" is the sum of the first m + 1 terms in the series expansion 
of expl- E” w,27'), and R*,,, is a remainder term. Alternatively, 


| (14) Ф(г) = – 17 log(1 —2it) + Y Да 22) - 1] +В, 


г=1 


where 


(15) Вина = LO(Qu 7*9) + L0(y on). 
k j 


In (14) we have expanded g(0) in the same way we expanded g(t) and have 
collected similar terms. . 
Then 


(16) A0) =e% 
= (1 - 24) Y op y e,(1—2it) — Y 20 
ғ= 1 r=] 


1 m -r -2r 
-a-au *[T] 1+ w,(1—-2it) + dro? - 2i) ? -| 
= 


m 1 
Хх (1- 4+ т?) + Rass} 


= (1 - 2ir ML + T (0) + TG) + € T, (0) + Rad, 
where T,(t) is the term in the expansion with terms «ej! + oy, Lis; =r; for 
example, 


(17) T,(t) = j| - 23) ' - 1]. 
(18) T)(t) = e;[( 24) ? - 1] + 3e2[0 72i) 7? -20 7250) + 1]. 


In most applications, we will have x, = с» 6 and y; = 4,0, where c, and d; 
will be constant and 6 will vary (i.e., will grow with the sample size). In this 
case if p is chosen so (1— p)x, and (1— p)y;.have limits, then R$, is 
0(0-("+1), We collect in (16) all terms wf! + в, Xis; = г, because these 
terms are О(07'). 
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It will be observed that 7,(/) is a polynomial of degree ғ in (1 — 2it)~! and 


each term of (1 — 2й)- */T (t) is a constant times (1 — 2it)- ?" for an integral 


v. We know that (1 — 2ir)- 2” is the isti 
l characteristic functi 2. i 
with и degrees of freedom; that is, ction of the density 


-1 П 


(19) &(2) = eh 


l a 
2er) 


® 1 - 0 
= [з= —2it) * e" dt. 


sc-[ + "ES » 
r 2s 0 7 280) L(t)e Hd 
(20) | 
iv 2 1 > = 1 
R$. = J 0 — 2it) gn —itz dt. 


n+l e 


Then the density of pM is 


oo 1 А т 
(21) f| Troe а= Y S,(z) +RË 
r=0 


=g;(z) + о [ Ву+2(2) -g;(z)] 
+ ЕКО —8,(2)| 


+ [gs i(2) - 285,02) +88] 


++ +5, (2) + В; 


т+!' 


U, "E 20 
2) (zo) f $c. 


v 20 iv 
Ки = f, К" dz. 


The cdf of M is written in terms of the cdf of oM, which is the integral of 
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the density, namely, 


(23) РИМ x Mj) 
Pr( pM < РМ} 


т 
Y U,( pMo) + Къ 
г=0 


-Pxjz pM) + wo(Pr{ Хро S pMy} - Pri хр < pM,]) 


+ w,(Pr{ Xa < pMo} ~ Pr{ хў < РМ. } + (рц Xa 5 pMa} 


~2Pr{ xfa рМ) + Pr{ x7 < рм.}} 
+ +U,,( pM) t Киз . 


The remainder А}, is Oo(e'"*"), this last statement can be verified by 
following the remainder terms along. (In fact, to make the proof rigorous one 
needs to verify that each remainder is of the proper order in a uniform 
sense.) 

In many cases it is desirable to choose р so that w, = 0. In such a case 


- using only the first term of (23) gives an error of order 07°. 


Further details of the expansion can be found in Box's paper (1949). 


Theorem 8.5.1. Suppose that ё W^ is given by (1) for all purely imaginary 


` h, with (2) holding. Then the cdf of —2 plogW is given by (23). The error, 


yo is OC) if x, сиб, yz 4,0 (с, > 0, 4, > 0), and if (1 — р)хь, 
(1 — p)y; have limits, where p may depend on 9. 


Box also considers approximating the distribution of —2plogW by an 
F-distribution. He finds that the error in this approximation can be made to 
be of order 7°. 


8.5.2. Asymptotic Distribution of the Likelihood Ratio Criterion 


We now apply Theorem 8.5.1 to the distribution of —2log A, the likelihood 
ratio criterion developed in Section 8.3. We let W = А. The hth moment of А 
is 


piP[s(N-qt1—-k + №) 
Ah c = 
(24) 6M К Ге М) 
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and this holds for all A for which the gamma functions exist, including purely 
imaginary h. We let a=b=p, 


x, = iN, & =3(-qt+1-k), В, = 2(1-p)N, 
(25) i 
y5N, = 35(-4+1-Ј), s=2(l—-p)N. 


We observe that 


(26) 2 - Y шоме аа a 
= 2P 


3pN 
2 р 
= зу 2 


ЕЕЕ NE 
k=1 2 


4 


PQ, 
Jpn |— 20. - p)N + 2q,—2+ (p +1) +4; + 2]. 
То make this zero, we require that 
N-q,.-3(p+q,t+1 
(27) p= 92 Р di ). 


Then 


(28) Pr -2% log à 2. 


+ Ta (Pr{ дд на <2} = Prf xia 2) 
tg L (р roses <z}- Pri Xpan <z)}) 


– ¥3(Pr{ Xpq e <z) ~ Pr{ Xpa, «JJ + К5, 
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where 
(239) k-pN-N-q;-i(p*q +1) =п- (р +1), 


pa( p^ * 4i 5 
Q0) y= Pale gis). 


p 
(31) y= 22 dl [ap +391 + 10р?4? – 50( p? +41) 159]. 


Since A = UN 


sno Where n = N -q, (28) gives Pr( — k log U, <2). 


Р.а." 77 
Theorem 8.5.2. The cdf of —klogU,,, is given by Q8) with k=n 
i(p—-q,- 1, and y, and y, given by (30) and (3D, respectively. The 
remainder term is ОСМ). 


The coefficient k =n — (р —q, +1) is known as the Bartlett correction. 
If the first term of (28) is used, the error is of the order №72; if the second, 
М№-4; and if the third №6. The second term is always negative and is - 
numerically maximum for z= y ( pq; + 2)( pq) Copa +1, approximately). 
For p = 3,q, z 8, we have y,/k? < Кр? + q1)/kY /96, and the contribution 
of the second term lies between —0.005[(p? + 42) /К and 0. For р> 3, 
qı > 3, we have y, < y2, and the contribution of the third term is numerically 
less than (y,/k?)*. A rough rule that may 5e followed is that use of the first 
term is accurate to three decimal places if p? +q? < k/3. 

As an example of the calculation, consider the case of p —3, 4 = 6, 
М-4.=24, and 2 = 26.0 (the 10% significance point x4). In this case 
y,/k? = 0.048 and the second term is — 0.007: y,/k* = 0.0015 and the third 
term is —0.0001. Thus the probability of — 19108 U3 ¢ is < 26.0 is 0.893 to 
three decimal places. 

Since 


(32) -[n- (р-т * 2)]logu, s, (2) = Cp np (0) на), 
the proportional error in approximating the left-hand side by Х2„(а) is 
Cp,m,n-p+1 — 1. The proportional error increases slowly with p and m. 


8.5.3. A Normal Approximation 


Mudholkar and Trivedi (1980), (1981) developed a normal approximation to 
the distribution of — log U, m,„ which is asymptotic as p and/or т — œ. It is 
related to the Wilson- Hilferty normal approximation for the x ?-distribution. 


tBox has shown that the term of order N^ is 0 and gives the coefficients to be used in the term 
of order №6. 
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First, we give the background of the approximation. Suppose (Y,) is a 
sequence of nonnegative random variables such that (Y, — u,)/c, £ NO 1) 
as ^0 where ФУ, = m, and Y(Y,) = о. Suppose also that Hk 7 00 and 
c / My 1$ bounded as К > œ. Let Z, = (Y,/u,)^. Then 


33 220-1 a(Z-1) 2 
(33) Amm) he №01) 


by Theorem 4.2.3. The approach to normality may be accelerated by choosin: 

h to make the distribution of Z, nearly symmetric as measured bv its third 

cumulant. The normal distribution is to be used as an approximation andi 

justified by its accuracy in practice. However, it will be convenient to devel р 

the ideas in terms of limits, although rigor is not necessary. “ee 
By a Taylor expansion we express the Ath moment of Y, / Hy, aS 


+ A 71) оё 


-1 
2 Hk 


+ MO D( 2) 4$, - Xh - 3)( T/m) 


ul * O( nz"), 


where $, = &(Y, — ш)? /u,, assumed bounded. The rth moment of Z, is 
expressed by replacment of h by rh in (34). The central moments of Z, ‘are 
(35) 
2 | 
в(2, 1) =p Èk ЮВ - 1) 264% (8—5) (олш)? -3 
P 2 и? + o( Hk ), 


Q9 — ez, -1 o SE E30 Dok) 


ni + OC He"): 


To make the third moment approximately 0 we take h to be 


(37) ho =1- E (Yp - n4) te 
Зор у 


Then 2,=(У,/н,№ is treated as norm istri i 
ally distributed with mean and 
variance given by (34) and (35), respectively, with h = ho. " 


| (42) «,( —log U,,21,n) =27(r-1)! È 
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Now we consider —log 0, m,n = — 24-1 log И, where V,,...,V, are inde- 
pendent and V, has the density BG; -1-D/2, m/2, 1=1..... р. Аз 
n~oo and m- оо, -logV, tends to normality. If V has the density 
ВО; a /2, b /2), the moment generating function of —logV is 


ови Га + b)/2]T(a/72 - t) 
(38) бе = T(a72)TI(a &5)/2- 1]: 


Its logarithm is the cumulant generating function. Differentiation of the last 
yields as the rth cumulant of V 


(39) c= ylei- ve" (55*]] rds 


where u(w) = d log l'éw)/dw. [See Abramovitz and Stegun (1972), p. 258, for 
example.] From I'(w + 1) = wI'(w) we obtain the recursion relation iiw + 1) 
= p(w) + 1/w. This yields for s = 0 and / an integer 


1-1 
(40) p(w D) — ow) = COD У — 
j=0 (w +j) 


The validity. of (40) for s = 1,2,... is verified by differentiation. [The expres- 
sion for y'(Z) in the first line of page 223 of Mudholkar and Trivedi (1981) is 
incorrect.) Thus for b = 21 


1-1 


1 
4 С (к) E 5. 
en 009 b (a/2+j) 


From these results we obtain as the rth cumulant of — log Up rin 


2 


1-1 1 

j-o (n-ic1-2]Y. 

As l> оо the series diverges for r= 1 and converges for r= 2,3, and hence 
k,/K, > 0, r= 2,3. The same is true as p > oo (if n/p approaches a positive 
constant). 

Given n, p, and J, the first three cumulants are calculated from (42). Then 
hy is determined from (37), and (- log Upar n)" is treated as approximately 
normally distributed with mean and variance calculated from (34) and (35) 
for h = A,. 

Mudholkar and Trivedi (1980) calculated the error of approximation for 
significance levels of 0.01 and 0.05 for n from 4 to 66, p=3.7, and 
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q = 2.6, 10. The maximum error is less than 0.0007; in most cases the error is 
considerably less. The error for the y?-approximation is much larger, espe- 
cially for small values of n. 

In case of m odd the rth cumulant can be approximated by 


p [Кт-3) 1 1 1 


413)  2'(r— 1)! +5 d. 
(89) ("И > 2 (n-i*1-2j) *2-irm) 


Davis (1933. 1935) gave tables of y(w) and its derivatives. 


8.5.4. Ап F-Approximation 


Rao (1951) has used the expansion of Section 8.5.2 to develop an expansion 
of the distribution of another function of U, „.„ in terms of beta distribu- 


tions. The constants can be adjusted so that the term after the leading one is 
of order m~*. А good approximation is to consider 


1-UV* ks-r 
ии: рт 


(44) 
as F with pm and ks —r degrees of freedom, where 


рт? —4 „_ рт _ 


45 =,/ ——, 
(45) 5 ptim – 5 2 


, 


and kis n - Цр-т- 1. For p -10r20r т = 1 ог 2 the F-distribution is 
exactly as given in Section 8.4. If ks — ғ is not an integer, interpolation 
between two integer values can be used. For smaller values of m this 
approximation is more accurate than thc x ^-approximation. 


8.6. OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 


8.6.1. Functions of Roots 


Thus far the only test of the linear hypothesis we have considered is the 
likelihood ratio test. In this section we consider other test procedures. 

Let $$, Bio, and B,,, be the estimates of the parameters in N(Bz, X), 
based on a sample of N observations. These are a sufficient set of statistics, 
and we shall base test procedures on them. As was shown in Section 8.3, if 
the hypothesis is В, = В*, one can reformulate the hypothesis as B, = 0 (by 
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replacing x, by x, — Br z). Moreover, 
(1) Bz-Bz + вы 

= Biz — Ay, An) zC) + (B + Bii 47 )= 2 


а 
= ж(1 
= Bi ) + В; 20, 


р Хх pl) 0), — ж), ж) — à à 5 
where E,z; z0 —0 апа E,z; 2; ' = 4.2. Then B, = Big and Bj = 


В... 

We shall use the principle of invariance to reduce the set of tests to be 
considered. First, if we make the transformation X* =X, + 'z, we leave 
the null hypothesis invariant, since & Xë = B,z* + (Bj + Г): and В +T 
is unspecified. The only invariants of the sufficient statistics are $ and В, 
(since for each В“, there is а Г that transforms it to 0, that is, — B). 

Second, the null hypothesis is invariant under the transformation 2**0 = 
Cz (C nonsingular); the transformation carries B, to В,С`'. Under this 
transformation X and B, 4, B; are invariant; we consider A,,.. as informa- 
tion relevant to inference. However, these are the only invariants. For 
consider a function of B, and 4,,.;, say fÊ., 411.2). Then there is a C* that 
carries this into KB,c* ^, I) and a further orthogonal transformation 
carries this into f(T, I), where ¢,,=0, i «v, t; 2 0. (If each row of T is 
considered a vector in q,-space, the rotation of coordinate axes can be done 
so the first vector is along the first coordinate axis, the second vector is in the 
plane determined by the first two coordinate axes, and so forth). But T isa 
function of ТГ’ = B, 4,,.,Ві; that is, the elements of T are uniquely deter- 
mined by this equation and the preceding restrictions. Thus our tests will 
depend on $ and B, 4,,.,B;. Let NÊ = С and B,A,,..Bi =H. 

Third, the null hypothesis is invariant when x, is replaced by Kx,, for X 
and В are unspecified. This transforms С to КСК’ and Н to KHK'. The 
only invariants of G and H under such transformations are the roots of 


(2) ІН - IG| =0. 
It is clear the roots are invariant, for 
(3) 0 = |KHK' — IKGK'| 


= |K(H — IG) K'| 
= |K| -|H — IG| -|K'|. 


On the other hand, these are the only invariants, for given G and H there is 
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а К such that КСК’ = Г and 


h 0 0 
1 0 
(4) KHK'-L-|. © , 
0 0 1 


where J, > + l, are the roots of (2). (See Theorem A.2.2 of the Appendix.) 


Theorem 8.6.1. Let x, be an observation from МВ, 2* + ВО, X), 
where Y,,22020' = 0 and Уж’ —4,,,. The only functions of the 
sufficient statistics and Ар. invariant under the transformations хў =x, + 
TzO,z5*U = Ce. and х* = Kx, are the roots of (2), where С = МУ and 


H- EA, Bi. 
The likelihood ratio criterion is a function of 


(5) U iG _ |KGK'| |l 


1С +н] I|KGK + КНК] «L| 
р 

= П(1+1) 7°, 
і= 1 


which is clearly invariant under the transformations. 

Intuitively it would appear that good tests should reject the null hypothesis 
when the roots in some sense are large, for if B, is very different from 0, then 
В, will tend to be large and so will H. Some other criteria that have been 
suggested are (a) У/,, (b) X1/(1 +/,), (c) max /;, and (d) min /;. In each case 
we reject the null hypothesis if the criterion exceeds some specified number. 


8.6.2. The Lawley-Hotelling Trace Criterion 


Let К be the matrix such that КСК’ =I [G = K (K') l, ог G`! =K'K] 
and so (4) holds. Then the sum of the roots can be written 


p 
(6) У, = = tr KHK' 
ici 


= tr HK’K=trG™!. 
This criterion was suggested by Lawley (1938), Bartlett (1939), and Hotelling 


(1947), (1951). The test procedure is to reject the hypothesis if (6) is greater 
than a constant depending on p, m, and n. 
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The general distribution’ of tr НС! cannot be characterized as easily as 
that of О, m,n- In the case of p=2, Hotelling (1951) obtained an explicit 
expression for the distribution of tr HG^! =], +1,. A slightly different form 
of this distribution is obtained from the density of the two roots /, and /, in 


Chapter 13. It is 


(7) РНС! <w} =Jy jo (Um - 1n - L) 


YaT[i(m*n- 


1 -in-1) 1 1 
T(im)T (in) larw) дан [Ст — Пи = D]. 


where J,(a, b) is the incomplete beta function, that is, the integral of y: a. b) 
from 0 to x. 

Constantine (1966) expressed the density of tr НС! as an infinite series ” 
in generalized Laguerre polynomials and as an infinite series in zonal 
polynomials; these series, however, converge only for tr HG^! <1. Davis 
(1968) showed that the analytic continuation of these series satisfies a system 
of linear homogeneous differential equations of order p. Davis (1970a, 
1970b) used a solution to compute tables as given in Appendix B. 

Under the null hypothesis, G is distributed as 2.17. (n = N — q) and 
Н is distributed as X2, Y, Y,, where the Z, and Y, аге independent, each 
with distribution N(0, X). Since the roots are invariant under the previously 
specified linear transformation, we can choose К so that КУК’ =I and let 
G* = KGK' [= X(KZ,X KZ, Y] and H* = KHK'. This is equivalent to assum- 
ing at the outset that È = I. 

Now 


201 >o nili r 
(8) plim 76 = plim кая & 2028 1. 


N>% 


This result follows applying the (weak) law of large numbers to each element 
of (1/n)G, 


іа ja ije 


(9) plim > Y ZiaZja 7 6Zi Zi, = 8 
ж=1 


n>% 


Theorem 8.6.2. Let f(H) be a function whose discontinuities form a set of 
probability zero when H is distributed as а У, У with the Y, independent, each 


with distribution N(0, Г). Then the limiting distribution of f(NHG~') is the 
distribution of СН ). 


TLawley (1938) purported to derive the exact distribution, but the result is in error. 
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Proof This is a straightforward application of a general theorem [for 
example, Theorem 2 of Chernoff (1956)] to the effect that if the cdf of X, 
converges to that of X (at every continuity point of the latter) and if g(x) is 
a function whose discontinuities form a set of probability 0 according to 
the distribution of X, then the cdf of g( X,) converges to that of g(X). In our 
case X, consists of the components of H and G, and X consists of the 
components of H and J. в 


Corollary 8.6.1. The limiting distribution of N tr НС! or ntr НС" is the 
y`-distribution with pq, degrees of freedom. 


This follows from Theorem 8.6.2, because 


p р 
(10) wH= Уһ. = У, Үг. 
і=1 i=l v=] 
Ito (1956),(1960) developed asymptotic formulas, and Fujikoshi (1973) 
extended them. Let и, ,, ,(a) be the o significance point of tr HG '; that is, 


(11) Pr(tr HG7! > wp, „„(а)) = а, 


and let х2(а) be the «significance point of the x^-distribution with k 
degrees of freedom. Then 


1 +т+1 
Dr 


(12) mv, , (0) = Xpm(@) + 25 pm 2 
+(р-т+ Dads] +0(п7?). 


Ito also gives the term of order n ^. See also Muirhead (1970). Davis 
(19703), (19706) evaluated the accuracy of the approximation (12). Ito also 
found 


1 [р+т+1., 


(13) Pr(ntr HG"! <z} =Ст(2) - 54 рт +2 


+(р-т + 08,„(2)| * O(n^?), 


where G,(z) = Pr( х2 <z} and g,(z) = (d/dz)G,(z). Pillai (1956) suggested 
another approximation to nw, m „(а), and Pillai and Samson (1959) gave 
moments of tr HG^!. Pillai and Young (1971) and Krishnaiah and Chang 
(1972) evaluated the Laplace transform of tr НС"! and showed how to invert 
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mhe палот. Khatri апа Pillai (1966) suggest an approximate distribution 
sed on moments. Pillai and i istri 
tions based on the first three momen D Suggest approximate distribu- 
Tables of the significance points are given by Grubbs (1954) for p = 2 and 
by Davis (1970a) for p — 3 and 4, Davis (1970b) for р = 5, and Davis (1980) 
for p = 6(1)10; approximate significance points have been given by Pillai 
(1960). Davis's tables are reproduced in Table В.2. 2n 


8.6.3. The Bartlett-Nanda- Pillai Trace Criterion 


Another criterion, proposed by Bartlett (1939 al. 
(1955), is y (1939), Nanda (1950), and Pillai 


p 
l. 
(14) V= Yer sels Ly" 


= tr KHK'(KGK' + КНК’)! 
= tr HK'|K(G+H)K']"'K 
=trH(G+H)"', 


where as before K is such that KGK' =I and (4) 
= holds. In t 
AI del) reque n terms of the roots 


(15) | IH — f(H+G)| - 0, 


the criterion is Yf., Л. In principle, the cdf, density, and moments under the 
null hypothesis can be found from the density of the roots (Sec. 13.2.3) 


P р 
(16) CT] 2Р1) L руха=р-10) 
Пла л), 
where 
(17) Тт +n)] 
T, (эп) Г, (mE, (2p) 


for 1»f,»-- >}, >0, and 0 otherwise. If т — p and n —p are odd, the 


density is a polynomial in f f,. Th i 
pof, Then the dens d 
the roots are polynomials. , У and edt of the sum of 


Many authors have written about the moments, Laplace transforms, densi- 
ties, and cdfs, using various approaches. Nanda (1950) derived the distribu- 
tion for p = 2,3,4 and m — p +1. Pillai (1954), (1956), (1960) and Pillai and 
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Mijares (1959) calculated the first four moments of V and proposed approxi- 
mating the distribution by a beta distribution based on the first four mo- 
ments. Pillai and Jayachandran (1970) show how to evaluate the moment 
generating function as a weighted sum of determinants whose elements are 
incomplete gamma functions; they derive exact densities for some special 
cases and use them for a table of significance points. Krishnaiah and Chang 
(1972) express the distributions as linear combinations of inverse Laplace 
transforms of the products of certain double integrals and further develop 
this technique for finding the distribution. Davis (1972b) showed that the 
distribution satisfies a differential equation and showed the nature of the 
solution. Khatri and Pillai (1968) obtained the (nonnull) distributions in 
series forms. The characteristic function (under the null hypothesis) was 
given by James (1964). Pillai and Jayachandran (1967) found the nonnuli 
distribution for p —2 and computed power functions. For an extensive 
bibliography see Krishnaiah (1978). 

We now turn to the asymptotic theory. It follows from Theorem 8.6.2 that 
nV or NV has a limiting y?-distribution with pm degrees of freedom. 

Let v, „,„(а) be defined by 


(18) Pr(tr H(H  G)! >v, m (0)) =а. 


Then Davis (19702), (19706), Fujikoshi (1973), and Rothenberg (1977) have 
shown that 


р+т+1 
рт +2 Xpm( a) 


(19) nw, (a) = x4, (a) + Li = 
+(p-m+ о, (а) +0(п7?). 


Since we сап write (for the likelihood ratio test) 


2 1 1 
(20) DH, s nl &) = Xp (a) + vri -m+ 1) Xpm(@) +0(п7?), 
we have the comparison 


_ 1 +т +1 
(21) mo, (o) onus m (а) + зт рро Xala) +0(87?), 


_ 1 +т +1 
(22) RU, m n 9) = nu, mn( @) + 2n Рот X) + O(n?). 


86 OTHER CRITERIA FOR TESTING THE LINEAR HYPOTHESIS 333 


An asymptotic expansion [Muirhead (1970), Fujikoshi (1973)] is 
m 
(23) Pr{nV<z} =G,,(2) + 47 [(m —p - 16,2) 
+2(р+1)би+2(=) - (ptm +1)Gymsa(zZ)] +00177). 


Higher-order terms are given by Muirhead and Fujikoshi. 


Tables. Pillai (1960) tabulated 1% and 5% significance points of V for 
p = 2(1)8 based on fitting Pearson curves (ie. beta distributions with ad- 
justed ranges) to the first four moments. Mijares (1964) extended the tables 
to p=50. Table B.3 of some significance points of (n + m)V/m = 
tr (17m) H /(n + m)KG + H)) ^! is from Concise Statistical Tables, and was 
computed on the same basis as Pillai's. Schuurman, Krishnaiah, and 
Chattopodhyay (1975) gave exact significance points of V for p — 2(1)5; а 
more extensive table is in their technical report (АВТ, 73-0008). А compari- 
son of some values with those of Concise Statistical Tables (Appendix B) 
shows a maximum difference of 3 in the third decimal place. 


8.6.4. The Roy Maximum Root Criterion 


Any characteristic root of HG~' can be used as a test criterion. Roy (1953) 
proposed /,, the maximum characteristic root of HG ^, on the basis of his 
union-intersection principle. The test procedure is to reject the null hypoth- 
esis if /, is greater than a certain number, or equivalently, if f, — наи) 
=R is greater than а number r, m „(œ) which satisfies 


(24) РЦК > ғ, „„(а)) = a. 


The density of the roots f,,...,f, for p < т under the null hypothesis is 
given in (16). The cdf of R =f,, Pr(f, <f*}, can be obtained from the joint 
density by integration over the range 0 «f, x ^^ <} sf*. Ит-р and 
n — p are both odd, the density of f,,..., f, is a polynomial; then the cdf of 
f, is à polynomial in f* and the density of f, is a polynomial. The only 
difficulty in carrying out the integration is keeping track of the different 
terms. | 

Roy [(1945), (1957), Appendix 9] developed а method of integration that 
results in a cdf that is a linear combination of products of univariate beta 
densities and | eta cdfs. The cdf of f, for p = 2 is 


(25) РИА f) - Ij(m- 1n - 1) 


УтГ$(т+"- 1)] (т in-10 1 t 
-—mbgney 4^ 0-07 Ит Dae DJ: 
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This is derived in Section 13.5. Roy (1957), Chapter 8, gives the cdfs for 
p = 3 and 4 also. 

By Theorem 8.6.2 the limiting distribution of the largest characteristic root 
of nHG^', NHG^!, nH(H + G) !, ог NH(H + С)! is the distribution of 
the largest characteristic root of H having the distribution ИС, т). The 
densities of the roots of H are given in Section 13.3. In principle, the 
marginal density of the largest root can be obtained from the joint density by 
integration, but in actual fact the integration is more difficult than that for 
the density of the roots of НС! or НН + GY '. 

The literature on this subject is too extensive to summarize here. Nanda 
(1948) obtained the distribution for p = 2, 3, 4, and 5. Pillai (1954), (1956), 
(1965), (1967) treated the distribution under the null hypothesis. Other 
results were obtained by Sugiyama and Fukutomi (1966) and Sugiyama 
(1967). Pillai (1967) derived an appropriate distribution as a linear combina- 
tion of incomplete beta functions. Davis (19722) showed that the density of a 
single ordered root satisfies a differential equation and (1972b) derived a 
recurrence relation for it. Hayakawa (1967), Khatri and Pillai (1968), Pillai 
and Sugiyama (1969), and Khatri (1972) treated th: noncentral case. See 
Krishnaiah (1978) for more references. 


Tables. Tables of the percentage points have been calculated by Nanda 
(1951) and Foster and Rees (1957) for p — 2, Foster (1957) for p — 3, Foster 
(1958) for p = 4, and Pillai (1960) for р = 2(1)6 on the basis of an approxirna- 
tion. [See also Pillai (1956), (1960), (1964), (1965), (1967).] Heck (1960) pre- 
sented charts of the significance points for p — X1)6. Table B.4 of signifi- 
cance points of п/, /т is from Concise Statistical Tables, based on the 
approximation by Pillai (1967). 


8.6.5. Comparison of Powers 


The four tests that have been given most consideration are those based on 

Wilks's U, the Lawley-Hotelling W, the Bartlett-Nanda-Pillai V, and Roy's 

R. To guide in the choice of one of these four, we would like to compare 

power functions. The first three have been compared by Rothenberg on the 

basis of the asymptotic expansions of their distributions in the nonnull case. 
Let vj,...,w? be the roots of 


(26) КВ, – В)Аи.>(В, - By)’ - »x|- 0. 


The distribution of 


(27) tr (Bia – BY) А (Bia - В')' х 
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is the попсепега! _y*-distribution with рт degrees of freedom and noncen- 
trality parameter У? v". As N — co, the quantity (1/n)G or (1/N)G 
approaches € with probability one. If we let N — oo and А. is unbounded. 
the noncentrality parameter grows indefinitely and the power approaches 1. 
It is more informative to consider a sequence of alternatives such that the 
powers of the different tests are different. Suppose В = B. is a sequence of 
matrices such that as N co, (Bi — BI)4,., (B) — В*) approaches a limit 
and hence v*,..., v approach some limiting values v,,..., »,, respectively 
Then the limiting distribution of N tr HG^!, ntr HG-!, МЕНН +G). 
and ntrH(H+G)"' is the noncentral x?-distribution with pm degrees of 
freedom and noncentrality pararaeter Y? у. Similarly for —N log U and 
—nlog U. 


Rothenberg (1977) has shown under the above conditions that 


(28) р Pr{U < Uy m.n( 9)] -1-G 


pm XP a) 


X | 


i=] 
1 р 
НКО 
а 2 1 
: 2 (а 1 
+ 2. V2, «e| х2, Я} «(1 
(29) Pr[tr HG! >w, m и(а)} 


=1-G 


‘om | Xpn( €) 


p 
Ë x 
і= 1 
1 7р 
~ 2n (р+т+ DE ип | Xo.) 


р 
+ r ув +6 [ Xpm( а)| 


P p 2 
_ 2; р+т+1 1 
[£ vi рт + 2 (х «| fil Z0) +o|ž), 
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Q0) РЦиН(Н+6) "21, „„(а)} 


£ 7 


i=l 


=1-G 


on | Xpm( 6) 


1 p 
momen È ива (а) 


р 
+ » тв | Xs a)] 


р 
[Eo tee etn] of) 


=1 


where G,(xly) is the noncentral x?-distribution with f degrees of freedom 
and noncentrality parameter y, and 8,(х) is the (central) x^-density with f 
degrees of freedom. The leading terms aré the noncentral x?-distribution; 
the power functions of the three tests agree to this order. The power 


functions of the two trace tests differ from that of the likelihood ratio test by 
£8 pmsl XP @))/(2n) times 


ay X ие 


і=1 


2 р 
- En ( 1 2). 
j| - Ecc APD 


where v — УР ,vj/p. This is positive if 


v, рр +2 
en pates 


where o,?= УР (и; – v)?/p is the (population) variance of v, ..., и; the 
left-hand side of (32) is the coefficient of variation. If the »,’s are relatively 
variable in the sense that (32) holds, the power of the Lawley—Hotelling trace 
test is greater than that of the likelihood ratio test, which in turn is greater 
than that of the Bartlett-Nanda-Pillai trace test (to order 1/n); if the 
inequality (32) is reversed, the ordering of power is reversed. 

The differences between the powers decrease as n increases for fixed 
и»... 0). (However, this comparison is not very meaningful, because increas- 
ing n decreases P, — Bj and increases Z'Z) 

A number of numerical comparisons have been made. Schatzoff (1966b) 
and Olson (1974) have used Monte Carlo methods; Mikhail (1965), Pillai and 
Jayachandran (1967), and Lee (1971a) have used asymptotic expansions of 
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distributions. All of these results agree with Rothenberg’s. Among these 
three procedures, the Bartlett-Nanda-Pillai trace test is to be preferred if 
the roots are roughly equal in the alternative, and the Lawley- Hotelling 
trace is more powerful when the roots are substantially unequal. Wilks's 
likelihood ratio test seems to come in second best; in a sense it is maximin. 

As noted in Section 8.6.4, the Roy largest root has a limiting distribu- 
tion which is not а x2-distribution under the null hypothesis and is not a 
noncentral x ?. distribution under a sequence of alternative hypotheses. Hence 
the comparison of Rothenberg cannot be extended to this case. In fact, the 
distributions п the nonnull case are difficult to evaluate. However, the 
Monte Carlo results of Schatzoff (1966b) and Olson (1974) are clear-cut. 
The maximum root test has greatest power if the alternative is one-dimen- 
sional, that is, if v, = ‘= р, = 0. On the other hand, if the alternative is not 
one-dimensional, then the maximum root test is inferior. 

These test procedures tend to be robust. Under the null hypothesis the 
limiting distribution of В, – B! suitably normalized is normal with mean 0 
and covariances the same as if X were normal, as long as its distribution 
satisfies some condition such as bounded fourth-order moments. Then Í = 
(1/N)G converges with probability one. The limiting distribution of each 
criterion suitably normalized is the same as if X were normal. Olson (1974) 
studied the robustness under departures from covariance homogeneity as 
well as departures from normality. His conclusion was that the two trace tests 
and the likelihood ratio test were rather robust, and the maximum root test 
least robust. See also Pillai and Hsu (1979). 

Berndt and Savin (1977) have noted that 


(33) trH(HG) < 10607! «tr HG. 


(See Problem 8.19.) If the x? significance point is used, then a larger 
criterion may lead to rejection while a smaller one may not. 


3.7. TESTS OF HYPOTHESES ABOUT MATRICES OF REGRESSION 
COEFFICIENTS AND CONFIDENCE REGIONS 


8.7.1. Testing Hypotheses 


Suppose we are given a set of vector observations x,,..., Xy with accompany- 
ing fixed vectors z;,..., zy, Where x, is an observation from N(Bz,. X). We 
let B = (B, В.) and z, = G2", ze»), where B, апа z?' have q, (74 — 4) 
columns. The null hypothesis is 


(1) Н: В, = Bi. 
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where Bf is a specified matrix. Suppose the desired significance level is o. А 
test procedure is to compute 


|. IN 
ІМ, | 


w 


(2) 


and compare this number with и, 4,„(@), the а significance point of the 
U -distribution. For p = 2,..., 10 and even т, Table 1 in Appendix B can 
be used. For m=2,...,10 and even p the same table can be used with m 
replaced by p and p replaced by m. (M as given in the table remains 
unchanged.) For p and т both odd, interpolation between even values of 
either p or m will give sufficient accuracy for most purposes. For reasonably 
large п. the asymptotic theory can be used. An equivalent procedure is to 
calculate Pr(U, m,n < U); if this is less than а, the null hypothesis is rejected. 
Alternatively one can use the Lawley—Hotelling trace criterion 
-1 


(3) W-wu(N£,-NÉ£,(NÉ,) 


= п (Bio В) (Ва В) (NÉS) 
the Pillai trace criterion 
(4) V-u(N£,-N£,)(NE,) 
= (В, В) (Ва - BY) (NÈ), 
or the Roy maximum root criterion R, where R is the maximum root of 
(Bia — Br) 4us(Bia – BD) – rN $4] - 0. 


These criteria can be referred to the appropriate tables in Appendix B. 

We outline an approach to computing the. criterion. If we let y, =x, — 
В“ 2), then y, can be considered as an observation from N(Az,, X), where 
A -(A, Aj) = (В, - Br. B). Then the null hypothesis is H : A, = 0, and 


(5) IN£,- М, - $ = 


(6) Ууу, = Уухх, — BIC, — С.В + Ві4 By’, 
(7) Y». -7€-Bi(4u Aj). 


Thus the problem of testing the hypothesis В, = Bi is equivalent to testing 
the hypothesis A, = 0, where Фу, = Az,. Hence let us suppose the problem 
is testing the hypothesis B, = 0. Then NX,—YEx,x, - B;,45B., and 
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NÊ, = Ухх, Bo AB. We have discussed in Section 8.2.2 the computa- 


tion of Во AB; and hence N $0. Then B,, 45 B;, can be computed in a 
similar manner. If the method is laid out as 


Ax Ay 
41 Ay 


Boo 
Bio 
the first q, rows and columns of A* and of А** are the same as the result of 
applying the forward solution to the left-hand side of 


C; 
Ci 


(8) 


(9) А»; = С, 


and the first q, rows of C* and C** are the same as the result of applying 
the forward solution to the right-hand side of (9). Thus В, 4„В, = СС", 
where C*' = (C*' C*’) and C**' = (Ci*' CF?) 

The method implies a method for computing a determinant. In Section 
А.5 of the Appendix it is shown that the result of the forward solution is 
FA = A*. Thus |F| -|A| =|A*|. Since the determinant of a triangular matrix 
is the product of its diagonal elements, |F| — 1 and 141 = |A*| = П až 
This result holds for any positive definite matrix in place of А (with a suitable 
modification of F) and hence can be used to compute |N $0! and |N $. 


8.7.2. Confidence Regions Based on U 


We have considered tests of hypotheses B, = В", where B* is specified. In 
the usual way we can deduce from the family of tests a confidence region for 


B,. From the theory given before, we know that the probability is 1 ~ о of 
drawing а sample so that 


IN S, 


10 TITAOO——————————M ru А . 
) ) |30 + (Bia - Bi) 4o (Bis — B,)’| р, 41, (a) 


Thus if we make the confidence-region statement that B, satisfies 


IN Za 
(11) Tre BET 2и (а), 
|М + (Bia - В.) лов, - 3u поте 


where (11) is interpreted as an inequality on В, = B,, then the probability is 
1 — æ of drawing a sample such that the statement is true. 


Theorem 8.7.1. The region (11) in the B,-space is a confidence region for 
В, with confidence coefficient 1 — а. 
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Usually the set of B, satisfying (11) is difficult to visualize. However, the 
inequality can be used to determine whether trial matrices are included in 
the region. 


8.7.3. Simultaneous Confidence Intervals Based on the 
Lawley-Hotelling Trace - 


Each test procedure implies a set of confidence regions. The Lawley—Hotell- 
ing trace criterion can be used to develop simultaneous confidence intervals 
for linear combinations of elements of В,. A confidence region with confi- 
dence coefficient 1 — a is 


-1 


(12) tr (Bio -= Pi)An2(Êio = В!) (№52) S Wy m, ín C 2). 


p 

To derive the confidence bounds we generalize Lemma 5.3.2. 
Lemma 8.7.1. For positive definite matices A and С, 

(13) ltr 'Y| < Vtr A7 9'Go vir AY'G Y. 
Proof. Let b — tr ®’Y/tr A ! ®’G®. Then 


(14) 0 <tr A(Y -bGbA"')'G(Y-5G6A ') 
-trAY'G Y - bir 'Y—btrY'b +Ь? и Ф'СФА! 


(tr Y)? 


-vayg y- М, 
т tA ®'G® 


which yields (13). L| 
Now (12) and (13) imply that 


(15) [eB -tr eB|- [re (Bis - В.) 


lA 


tr AR М Ф.А 2 (Ва - B)(NS4) ( 
holds for all p х m matrices Ф. We assert that 

(16) пФ В, - VN tr As e$, Vw, m (a) <tr OP, 

<tr ®'Big + VN tr A5, e$ Vw, (а). 


holds for all Ф with confidence 1 — а. 
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The confidence region (12) can be explored by use of (16) for various Ф. H 
фи = 1 for some pair (1, К) and 0 for other elements, then (16) gives an 
interval for Вк. If $; = 1 for a pair (7, К), —1 Юг (7, L), and 0 otherwise, 
the interval pertains to Вук — Ви, the difference of coefficients of two 
independent variables. If ф,; = 1 for a pair (/, К), —1 for (J, К). and 0 
otherwise, one obtains an interval for gj, — jg, the difference of coeffi- 
cients for two dependent variables. 


8.7.4. Simultaneous Confidence Intervals Based on the Roy Maximum 
Root Criterion 


A confidence region with confidence 1— a based on the maximum root 
criterior. is 


(17) chi(Bio ~ B,)411.2(Bin 7 В.) (Na) S ry s (00) 


where ch,(C) denotes the largest characteristic root of C. We can derive 
siinultaneous confidence bounds from (17). From Lemma 5.3.2, we find for 
any vectors a and b 


(18) 


- a' (Bia 7 B,)4,,.2(Bio 7 В.)'а 


a'Ga 
= св | (Bio - В.) 4..2 (Вл -= B)'a!] :a Ga: b' Aj. b 
ғ, (а) а'ба:Б'АТЬ 


"pam 


'a'Ga:b' Aj. b 


with probability 1 — a; the second inequality follows from Theorem A.2.4 of 
the Appendix. Then a set of confidence intervals on all linear combinations 
a'B,b holding with confidence 1 — а is 


(19) a'Bigb — V p.m.) а) ‘а'ба`Ь'Ат.Ь xa'Bib 


хавар + Vr, „(а)ба БАБ. 


The linear combinations are a’B,b = УР Уа, В.Б, М а = 1. a; = 0. 
i#1, апа b,—1, Б, = 0, л +1, the linear combination is simply Ви. If 
a,=1, a,=0,i#1, and b,—1, b; = —1, b, =0, A + 1.2. the linear combi- 
nation is Bj, — 81. 
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We can compare these intervals with (16) for Ф = ab', which is of rank 1. 
The term subtracted from and added to tr Ф’ Bio = =a 'В, ві is the square root 
of 
(20) Wom na) tr Ani ba' Gab! = и 


Pp pian 


(o):a'Ga-b' Aj,b. 


This is greater than the term subtracted and added to a'B, od in (19) because 
Wy m.@), pertaining to the sum of the roots, is greater than ғ, m, nla), 
relating to one root. The bounds (16) hold for all p x m matrices ®, while 
(19) holds only for matrices ab’ of rank 1. 

Mudholkar (1966) gives a very general method of constructing simultane- 
ous confidence intervals based on symmetric gauge functions. Gabriel (1969) 
relates confidence bounds to simultaneous test procedures. Wijsman (1979) 
showed that under certain conditions the confidence sets based on the 
maximum root are smallest. [See also Wijsman (1980).] 


8.8. TESTING EQUALITY OF MEANS OF SEVERAL NORMAL 
DISTRIBUTIONS WITH COMMON COVARIANCE MATRIX 


In univariate analysis it is well known that many hypotheses can be put in the 
form of hypotheses concerning regression coefficients. The same is true for 
the corresponding multivariate cases. As an example we consider testing the 
hypothesis that the means of, say, q normal distributions with a common 
covariance matrix are equal. 

Let yf? be an observation from N(p, У), а = 1,..., №, i= .,q. The 
null hypothesis is 


(1) H: p? =.. =p, 
To put the problem in the form considered earlier ir this chapter, let 
(2) X= (XiX Xy, Eyer” xy) = (УФ X9 - p y 0) 


with N = М, + : +N, Let 


(3) Z-(Zz Zi ^" Zw мар U Zy) 
lolo ce | 0 e 0 
0 0 -- O0 l e 0 
0 0 0 0 
0 0 0 0 


[wn 
— 
= 
m. 
en 
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that is, z; =1 if Nj+--+N_,<a<N, ++ +N, and 2;, = 0 otherwise, 
for і= 1,...,9— 1, and z,, =1 (all a). Let В = (В, Bj), where 


= (po — a7 D — y), 


В, = pe, 


(4) 


Then x, is an observation from N(Bz,, X), and the null hypothesis is В, = 0. 
Thus we can use the above theory for finding the criterion for.testing the 
hypothesis. 


We have 
N, 0 0 м 
x ом о № 
(5) A= У zaz, = : ; 
aci 0 0 Na Na 
N № Noi N 


N 
© с-ка (50 ne-re me, 


a=] а 
Here An =N and С, = X; , y. Thus В, = E; , y -(1/N) =), say, and 
(7) №. = Ухх, -Ny' 


= Уууу’ — Nyy’ 


ia 


= EG? -3)09-3). 


ра 


For $$, we use the formula NÊ, = Ex, x, — Bo AB) = Ex, x, - CA^! С". 
Let 


— 
e 

. c 
oc 


(8) D- 
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then 
1 0 0 
0 1 0 0 
(9) p^- : 
0 1 
11 11 
Thus 
(10) CAC’ —CD'D'A7p^!Dc' 


= CD'(DAD') ' DC' 


N, 0 eL 0 LX 
0 N, 0 
= (Ta Yn) TE . 
a Dt : "m 
0 0 - М Ls d 
- [mes Es") 
where y? = (1/N)Y, y. Thus 
(11) № = Eye ya” — ENJO” 
i i 
= Eo: i) y?)(y? =), 
It will be seen that Ê, is the estimator of X when p® = --: =p and $$ 


is the weighted average of the estimators of X based on the separate 
samples. 

When the null hypothesis is true, |М, „| / INÍ] is distributed as Оа, п 
where n = № — 4. Therefore, the rejection region at the а significance level is 


ANS 
INÊ l 


w 


(12) 


Ир q-i nla). 
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The left-hand side of (12) is (11) of Section 8.3, and 


Q3) NE -Na = Day" - y - (EIP Lenses 


i, a 1. а 
= EN(y?-y)?-»)'-B, 


as implied by (4) and (5) of Section 8.4. Here Н has the distribution 


W(X,q — 1). It will be scen that when p = 1, this test reduces to the usual 
F-test 

UN (5 0 y) n 
(14) oo >В -1.„(а). 


(0-5) 9-1 


We give an example of the analysis. The data are taken from Barnard's 
study of Egyptian skulls (1935). The 4 (= q) populations are Late Predynastic 
(i = 1), Sixth to Twelfth (i = 2), Twelfth to Thirteenth (i = 3), and Ptolemaic 
Dynasties (i= 4). The 4 (=p) measurements (i.e., componenis of y?) аге 
maximum breadth, basialveolar length, nasal height, and basibregmatic height. 
The numbers of observations аге N, = 91, №, = 162, №, = 70, М, = 75. The 


data are sumn arized as 


(15) (5 yj? y? y?) 


133.582 418 
98.307 692 
50.835 165 

133.000 000 


(16) NÉ, 


9661.997 470 

445.573 301 
1130.623 900 
2148.584 210 


From these data we find 


(17) NÊ, 


9785.178 098 

214.197 666 
1217.929 248 
2019.820 216 


134.265 432 
96.462 963 
51.148 148 

134.882 716 


445.573 301 
9073.115 027 
1239.211 990 
2255.812 722 


214.197 666 
9559.460 890 
1131.716 372 
2381.126 040 


134.371429 1 
95.857 143 
50.100 000 - 
133.642857 1 


1130.623 900 
1239.211 990 
3938.320351 
1271.054 662 


1217.929 248 
1131.716372 
4088.731 856 
1133.473 898 


35.306 667 
95.040 000 
52.093 333 |" 
31.466 667 


2148.584 210 
2255.812 722 
1271.054 662 
8741.508 829 


2019.820 216 
2381.126 040 


1133.473 898 |` 


9382.242 720 
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We shall use the likelihood ratio test. The ratio of determinants is 


(18) U- INÊ о] _ 2.4269054 x 10° 


= = 0.214344. 
1$)  2.9544475Х 105 


Here М = 398, n = 394, p = 4, and q = 4. Thus k = 393. Since я is very large, 


we may assume —k log Uj 3,394 is distributed as x^ with 12 degrees of. 


freedom (when the null hypothesis is true). Here —k log И = 77.30. Since the 
1% point of the x$-distribution is 26.2, the hypothesis of p? = p® = pO = 
и is rejected." 


8.9. MULTIVARIATE ANALYSIS OF VARIANCE 


The univariate analysis of variance has a direct generalization for vector 
variables leading to an analysis of vector sums of squares (i.e., sums such as 
Ex, x’). In fact, in the preceding section this generalization was considered 
for an analysis of variance problem involving a single classification. 

As another example consider a two-way layout. Suppose that we are 
interested in the question whether the column effects are zero. We shall 
review the analysis for a scalar variable and then show the analysis for a 
vector variable. Let У» i=1,...,7, /=1,....6, be a set of rc random 
variables. We assume that 


(1) éY;-ptA tw, =1,....Г, је 1,06, 


with the restrictions 
r с 
(2) У, А; = X vj = 0, 


that the variance of Y; is c ?, and that the Y;; are independently normally 


distributed. To test that column effects are zero is to test that 


(3) v; 0, ј= 1,6. 


Ј 


This problem can be treated as а problem of regression by the introduction 


*The above computations were given by Bartlett (1947). 
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347 
of dummy fixed variates. Let 
(4) 200, ij 1, 
2401 = 1, К =і 
= 0, ksi, 
Zo, ij = 1, к=}, 
= 0), К #j. 


Then (1) can be written 


(5) 


А 
e - 
EY = uzm 
ij = Mug У Àk Zko,ij + У Vj 20к, г]. 
c=] k=1 | 


The hypothesis is that the coefficients of 2 


fixed varíates here, o, ij ате zero. Since the matrix of 


Zo, 11 UT Zoo гс 

6 210,11 UU 210, 

(6) Zi C" ZQ ге 
> э 

Toc. ii UT Züe rc 


singular (for example, row 00 is the sum of rows 10,20 
elaborate. the regression theory. When one does one fin 
criterion indicated by the regression theory is the usual F- 


., 70), one must 
ds that the test 


variance. test of analysis of 
Let 
У = 1 Xx 
нем 
ij 
(7) _1 
Y= с LY, 
j 
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and let 


= 0-с + юү? 
hj і. j 


(8) 


Then the F-statistic is given by 


b —1 -1 
(9) r-.2.G-DC-D 


Under the null hypothesis, this has the F-distribution with c — 1 and (r — 1). 
(c — 1) degrees of freedom. The likelihood ratio criterion for the hypothesis 
is the rc/2 power of 


a = 


1 
a+b 1-*((c-1)/[(r- D(c-1]]F- 


(10) 


Now let us turn to the multivariate analysis of variance. We have а set of 
p-dimensional random vectors Yp ¢=1,...,r, f=1,...,c, with expected 
values (1), where p, the A's, and the v's are vectors, and with covariance 
matrix È, and they are independently normally distributed. Then the same 
algebra may be used to reduce this problem to the regression problem. We 
define У ,Y, ,У, by (7) and | 


А= У (Y;-Y.-Y;-Y)(Y,;-Y, -Y +Y) 
ij 


= LY Yy- cY Y- rE Y У, +ку у, 
i,j i i 
(11) 
B-rY(Y;-Y)(Y;-Y)y 
Jj 


-rYY;Y,-nYYW. 
i 
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Table 8.1 
nee oe 
Varieties 
У 
Location M 5 V T P Sums 
UF 81 105 120 110 98 514 
81 82 80 87 84 414 
W 147 142 151 192 146 778 
100 116 112 148 108 584 
M 82 77 | 78 131 90 458 · 
103 105 117 140 130 595 
C 120 121 124 141 125 631 
99 62 96 126 76 . 459 
GR 99 89 69 89 104 450 
66 30 97 62 80 355. 
D 87 77 79 102 96 441 
68 67 67 92 94 338 
Sums 616 61i 621 765 659 3272 


517 482 569 655 572 2795 


A statistic analogous to (10) is 


|A| 
(12) [AB 


Under the null hypothesis, this has the distribution of U for р, и = (н – 1): 
(c — 1) and 4, =c — 1 given in Section 8.4. In order for A to be nonsingular 
(with probability 1), we must require p « (r — 1Xc — 1). 

As an example we use data first published by Immer, Hayes, and Powers 


_ (1934), and later used by Fisher (19472), by Yates and Cochran (1938), and by 


Tukey (1949). The first component of the observation vector is the barley 


` yield in a given year; the second component is the same measurement made 


the following year. Column indices run over the varieties of barley, and row 


indices over the locations. The data are given in Table 8.1 [e.g., M 


upper left-hand corner indicates a yield of 81 in each year of variety M in 
location UF]. The numbers along the borders are sums. 
We consider the square of (147, 100) to be 


in the 


21,609 14,700 


147 
14,700 10,000 /" 


icu 100) = 


350 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


Then 

_ [380,944 aen) 
(13) = (315381 277,625)” 

_ {2,157,924 180346) 
(14) > (67. i )(6¥. j) pe 1,579,583 ]" 

Ј 
ai _ (1,874,386 1,560,145 
(15) LG) GE) Y (1506 Laon] 
10,750,984 9105200) 

(16) (30у. )(30у.)' = = (1075094 7.812,25] 


Then the error sum of squares is 


(17) A= (2279 а 


802 4017)’ 
the row sum of squares is 


, [18,011 7.188 
(18) 53 (Y. - Y.) (Y; Y.) = | 7,188 n 
Ј 


and the column sum of squares is 


(19) B= (2786 250) 


2550 2863] 


The test criterion is 


3279 802) 
(20) lAl _ | 802 40171 0.4107. 
ГА +В] [6067 3352 

3352 6880 


This result is to be compared with the significant point for U, 4,29- Using the 
result of Section 8.4, we see that 


D 22,66 


1- 0.4107 
aT 4107 4 


is to be compared with the significance point of Ё, зв. This is significant at 
the 5% level. Our data show that there are differer ces between varieties. 
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Now let us see that each F-test in the univariate analysis of variauce has 
analogous tests in the multivariate analysis of variance. In the linear hypothe- 
sis model for the univariate analysis of variance, one assumes that the 
random variables Y,,...,¥, have expected values that are linear combina- 
tions of unknown parameters 


8 


where the B'sare the parameters and the z's are the known coefficients. The 
variables {Y,} are assumed to be normally and independently distributed with 
common variance a. In this model there are a set of linear combinations, 
say EN 1 Yia Y,, where the у” are known, such that 


n NC 2 N 
(2) а= | E ve% - Y ауу 
i=] a 


‚В=1 


is distributed аз 92%? with n degrees of freedom. There is another set of 
linear combinations, say X a Pga Ya» where the ф’5 are known, such that 


m N 2 N | 
(23) b = Y | У &x = Y Cap Ya Ув 
а, В=1 


is distributed as a ?y? with m degrees of freedom when the null hypothesis is 
true and as 0? times a noncentral x? when the null hypothesis is not true; 
and in either case b is distributed independently of a. Then 


(24) 


has the F-distribution with т and n degrees of freedom, respectively, when 
the null hypothesis is true. The null hypothesis is that certain B’s are zero. 

In the multivariate analysis of variance, Ү,,..., Yy are vector variables with 
p components. The expected value of Y, is given by (21) where B g ÍS a vector 
of p parameters. We assume that the tr, } are normally and independently 
distributed with common covariance matrix X. The linear combinations 
Yy,,Y, can be formed for the vectors. Then 


n N N '. N 
(25) A= È| У xaJ У xx = У а,вҮ, Уз 
а, B= 


i=l a=] 1 
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has the distribution ИС», 2). When the null hypothesis is true, 


т [М І N ' N 
(26) B- £| 4. | L 60% = У Cap Y, Үз 
g=1\a=i а=1 а, В=1 


has the distribution W(Z, m), and В is indeper dent of A. Then | 


lal [24,4 Y.Y; | 


(27) = ht O 
IA+Bl |Y d, Y Y + Lica VY | 


has the U, ,, „distribution. 

The argument for the distribution of a and b involves showing that 
éYX,.y,Y,-0 and ФУ, ф,,Ү, = 0 when certain B's are equal to zero as 
specified by the null hypothesis (as identities in the unspecified ^s). Clearly 
this argument holds for the vector case as well. Secondly, one argues, in the 
univariate case, that there is an orthogonal matrix W = (4, ) such that when 
the transformation Y; = Y, Ҹа, Za is made 


n 
a= X dap Vay Iss Zy Zs = L 22, 
a=] 


а, В, у, 5 
(28) у 
п+т 
b- X Cap Way Vos Zy Zs = X 22. 
а, В, у, 5 а=п+1 


Because the transformation is orthogonal, the (Z,) are independently and 
normally distributed with common variance о?. Since the Z, &= 1,...,п, 
must be linear combinations of У „у,„Ү, and since Z,, а= и +1,...,п + m, 
must be linear combinations of X, $,, Y,, they must have means zero (under 
the null hypothesis). Thus а/о? and b/o? have the stated independent 
x?-distributions. 

In the multivariate case the transformation Y, = X, i4, Za is used, where 
Уз and Z, are vectors. Then 


n 
A= Y а,в Way Was 2,25 = X Zao 
a-i 


а, В, у, 8 
(29) ntm 
B= У c4 Z, Z= У 2,2, 
а, В, у, 5 а=п+1 


because it follows from (28) that У, pdag Way Ugo = 1, y= óxn, and =0 
otherwise, and Ly ,c,4U,,U,571, п+1< у= óx nt m, and =0 other- 
wise. Since W is orthogonal, the (Z,) are independently normally distributed 
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with covariance matrix `2. The same argument shows ¢Z,=0, «=1..... 
n- m, under the null hypothesis. Thus А and B are independently dis- 
tributed according to W(X, n) and ИС», m), respectively. 


8.10. SOME OPTIMAL PROPERTIES OF TESTS 


8.10.1. Admissibility of Invariant Tests 


In this chapter we have considered several tests of a linear hypothesis which 
are invariant with respect to transformations that leave the null hypothesis 
invariant. We raise the question of which invariant tests are good tests. In 
particular we ask for admissible procedures, that is, procedures that cannot 
be improved on in the sense of smaller probabilities of Type I and/or Type 
II error. The competing tests are not necessarily invariant. Clearly, if an 
invariant test is admissible in the class of all tests, it is admissible in the class 
of invariant tests. 

Testing the general lincar hypothesis as treated here is a generalization of 
testing the hypothesis concerning one mean vector as treated in Chapter 5. 
The invariant procedures in Chapter 8 are generalizations of the T "test. 
One way of showing a procedure is admissible is to display a prior distribu- 
tion on the parameters such that the Bayes procedure is a given test 
procedure. This approach requires some ingenuity in constructing the prior. 
but the verification of the property given the prior is straightforward. Prob- 
lems 8.26 and 8.27 show that the Bartlett -Nanda- Pillai trace criterion V and 
Wilks's likelihood ratio criterion U yield admissible tests. The disadvantage 
of this approach to admissibility is that one must invent a prior distribution 
for each procedure; a general theorem does not cover many cases. 

The other approach to admissibility is to apply Stein's theorem (Theorem 
5.6.5), which yields general results. The invariant tests can be stated in terms 
of the roots of the determinantal equation 


(1) |H - A(H +G)|=0, 


where Н = ВА B. = WW; and С = NÊ, = W,W;. There is also a matrix 


В, (or W,) associated with the nuisance parameters B.. For convenience, we 


define the canonical form in the following notation. Let W, =X (px m). 
W,-Y(pxr,W,-Z(pxn, €éX- E, €Y-H,and £Z = 0: the columns 
are independently normally distributed with covariance matrix У. The null 
hypothesis is = = 0, and the alternative hypothesis is E. 


The usual tests are given in terms of the (nonzero) roots of 


(2) | XX’ - A(ZZ' + XX')| =| XX! - A(U - YY)| - 0. 


354 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 


where И = ХХ' + YY' + ZZ'. Expect for roots that are identically zero, the 
roots of (2) coincide with the nonzero characteristic roots of X'(U — YY )'x. 
Let V 2 (X, Y, U) and 


(3) M(V)-X'(U-YY) X. 
The vector of ordered characteristic roots of M(V) is denoted by 
(4) (А.-А) =АСМ(И)), 


where А>: > А, 2 0. Since the inclusion of zero roots (when m» p) 
causes no trouble in the sequel, we assume that the tests depend on 
ACM(V )). l . 
The admissibility of these tests can be stated in terms of the geometric 
characteristics of the acceptance regions. Let 
К" = {ME R"|à > А, > 2А, 20), 
©) R"— (Ae R"|a, 2 0,..., 4, z 0]. 


It seems reasonable that if a set of sample roots leads to acceptance of the 
null hypothesis, then a set of smaller roots would as well (Figure 8.2). 


Definition 8.10.1. A region A СК" is monotone if NGA, v € В", and 
v, <A, i=1,...,m, imply v GA. 


Definition 8.10.2. For A СК" the extended region A* is 


(6) А* = (J las Xs) Ix € A), 
п 
where m ranges over all permutations of (1,..., m). 
А2 
А* 
А 
à 


v 


Figure 8.2. A monotone acceptance region. 
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The main result, first proved by Schwartz (1967), is the following theorem: 


Theorem 8.10.1. If the region А СК" is monotone and if the extended 


region A* is closed and convex, then A is the acceptance region of an admissible 
test. 


Another characterization of admissible tests is given in terms of majoriza- 
tion. 


Definition 8.10.3. А vector А = (А,,..., Àm) weakly majorizes a vector 
V= (Vires Va) if 


(7) An Z rm An + Ag = vg + Pee Ap + Am > ИН t Ut pgs 


where X and Yip t= 1,..., т, are the coordinates rearranged in nonascending 
order. 


We use the notation А > (у or у <,А if А weakly majorizes v. If 
A, v E R7 , then А > ,v is simply 


(8 Azv, AtA;zvtv,..., Ate +A, Sup te Hop. 


If the last inequality in (7) is replaced by an equality, we say simply that А 
majorizes v and denote this by А >v or v < A. The theory of majorization 


and the related inequalities are developed in detail in Marshall and Olkin 
(1979). 


Definition 8.10.4. А region A СР" is monotone in majorization ҒА € A, 
v€R?,v«,X imply v €A. (See Figure 8.3.) 


Theorem 8.10.2. Jf a region А СК" is closed, convex, and monotone in 
majorization, then A is the acceptance region of an admissible test. 


Ag 


А = (A, Ag) 


v 


Figure 8.3. A region monotone in majorization. 


356 TESTING THE GENERAL LINEAR HYPOTHESIS; MANOVA 8.10 SOME OPTIMAL PROPERTIES OF TESTS 357 

Theorems 8.10.1 and 8.10.2 are equivalent; it will be convenient to prove 
Theorem 8.10.2 first. Then.an argument about the extreme points of a 
certain convex set (Lemma 8.10.11) establishes the equivalence of the two 
theorems. 

Theorem 5.6.5 (Stein's theorem) will be used because we can write the 
distribution of (X, У, 2) in exponential form. Let U = XX' + YY' + ZZ' = (u,;) 
and X`! = (о). For a general matrix C=(e,,...,¢,), let vec(C) = 
(ej, ..., €k). The density of (X, У, Z) can be written as 


№ 


X? = A(M(V)) 


AD = A(M(V:)) 


pA + qal2) 


(9) f(X.Y, Z) =К(Е,Н, E) exp(tr SX ctr HX? иг x-!U] 


=K(E,H, X) exp oy + © Iq) + өзу), 


АМИ, + 472) 


i i 4. Th 8.10.3. 
where K(X,H, X) is a constant, Figure 8 corem 


€q-7vec(X 'E), eg -vec(X^!H), Theorem 8.10.3. 
(1 9 7 —3(o", 20? 52017,9723, g PP)! (11) X[ M( pV, * gV:)] > ,pA[M(V)] * 44 [M(V:)]. 
0 
Jo = Ve(X), уо = ved(Y), The proof of Theorem 8.10.3 (Figure 8.4) follows from the pait of 
р majorizations 
Уз) = (ni Uns.. py m, Uu) - | 
(12) A[M( pV, * qV;)] > M pMQA) * aM(Y;)] 


If we denote the mapping (X, У, Z) y = Gt Jo» Xa) by 8, у = gOGY, 2), 
then the measure of a set A in the space of y is m(A) = u(g^ A), where 
и is the ordinary Lebesgue measure on R7"'*^*"?, We note that (X, Y, U) is 
a sufficient statistic and so is y — (Yay Ус» Уз). Because a test that is 
admissible with respect to the class of tests based on a sufficient statistic 
is admissible in the whole class of tests, we consider only tests based on a 
sufficient statistic. Then the acceptance regions of these tests are subsets in 
the space of y. The density of y given by the right-hand side of (9) is of the 
form of the exponential family, and therefore we гап apply Stein's theorem. 
Furthermore, since the transformation (X, Y, U) > y is linear, we prove the 
convexity of an acceptance region of (X, Y, U). The acceptance region of an 


-pMM(9)] +9А[М(0,)]. 
The second majorization in (12) is a special case of the following lemma. 
Lemma 8.10.1. For A and B symmetric, 
(13) МА4+В) > „А(А) * A(B). 


Proof. By Corollary A.4.2 of the Appendix, 


(14) Y A(A B) max tr R'( A B)R 
i=l 


R’R=1, 

invariant test is given in terms of A(M(V))=(A,,...,A,,)’. Therefore, in 

order to prove the admissibility of these tests we have to check that the < max tr R'AR + CUNT BR 
ERK 

inverse image of A, namely, A = (V|X(M(V)) є A), satisfies the conditions 


of Stein's theorem, namely, is convex. 

Suppose V, = (X;, X; U,) € A, i = 1,2, that is, ММО] Е A. By the convex- 
ity of A, pALM(V,)]+qAIM(V,)] EA for0 <р=1-4<1. To show pV, + 
qv, € А, that is, ALM(pV, + ат, Ле A, we use the property of monotonicity 
of majorization of А and the following theorem. 
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Let 4» B mean А-В is positive definite and 42 B mean А-В is 
positive semidefinite. 
The first majorization in (12) follows from several lemmas. 


Lemma 8.10.2 


(13) pU, +40, — (рҮ, + q¥2)( PY, + 9Ү,)' 
zp(U, - Я) + q(U; - Y,Y;). 


Proof. The left-hand side minus the right-hand side is 
(16) pY Y,  qY. Yi — p Y Yi — 4! YjY; - pa(Y Yi + ҮҮ) 


= p(1 - pYyYi * q(1 7 9); Y; — pa( Y +N) 
= pq(Y, — Y;)( Y, - Y;)' 2 0. И 


Lemma 8.10.3. /fAz B» 0, then А! < В”. 
Proof. See Problem 8.31. a 
Lemma 8.10.4. If A> 0, then f(x, A) = x'A !x is convex т (x, A). 
Proof. See Problem 5.17. а 
Lemma 8.10.5. А, >20, А, > 0, then 
(17) (рв, + qBz)'( pA, + 44.) '( pB, + 4В,) <pB, A7 `B, +985 Az Bs. 
Proof. From Lemma 8.10.4 we have for all у 
(18) py'B\A,'B,y + qy B, A5 ВУ 
—y'( pB, + qB;)' (pA, + qA) (pB, + qB;)y 
= р( Biy)' AU (By) +9(Взу) Az (Bi) 


- (pB,y + qByy)'( pA, +942)  (pBiy + B29) 
>0. a 


Thus the matrix of the quadratic form in y is positive semidefinite. a 


The relation as in (17) is sometimes called matrix convexity. [See Marshall 
and Olkin (1979).] 
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Lemma 8.10.6. 
(19) M( pV, + qV;) <pM(V,) + qM(V2), 


where V, =(X,,¥,,U,), V = (X;,Y;,U,), U, - Y Y; > 0, 9, -У,У> 0, Ox p 
=1-49<1. 


Proof. Lemmas 8.10.2 and 8.10.3 show that 
(20) | [pU, +90. – (PY, + qY)) (BY, +97.) ] ^ 
< [pU - Y.) +а(0, - Y,Y2)] `". 
This implies 
(Q1) M(pV,*qV;) 
< C pX, + 4X2)'| p(U; - Y Y) - q(U; - Y,Y)] (pK, +ах,). 


Then Lemina 8.10.5 implies that the right-hand side of (21) is less than or 
equal to 


(2) pX,(U,-Y,Yj)) X, + qX}(U, – Ү,Ү;) 'Х,=рм(№,) * qM(V,). 


a 
Lemma 8.10.7. If A < В, then (A) < „А(В). 
Proof. From Corollary A.4.2 of the Appendix, 
k k 
(23) У л(4) = max tr R'AR < max trR'BR= У A((B), 
isi R'R-I, R'R=I, id 
k=1,...,p. Г] 


From Lemma 8.10.7 we obtain the first majorization in (12) and hence 
Theorem 8.10.3, which in turn implies the convexity of А. Thus the accep- 
tance region satisfies condition (i) of Stein’s theorem. 


Lemma 8.10.8. For the acceptance region A of Theorem 8.10.1 or Theorem 
8.10.2, condition (ii) of Stein’s theorem is satisfied. 


Proof. Let w correspond to (Ф, Y, Ө); then 


(24) w'y = уа) + 015 + 613 


=tr@'X+trW'Y— str ou, 
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where Ө is symmetric. Suppose that {ylw’y>c} is disjoint from А = 


(УМО) € A). We want to show that in this case Ө is positive semidefi- 
nite. If this were not true, then 

I 0 0 

0 -I o|r, 


0 0 0 


(25) @=D 


where D is nonsingular and —J is not vacuous, Let Х= (1/у)Х,, Y= 
(1/30Y,, 


I 0 0 
(26) U-(D')!|0 уг olp“, 
0 0 I 


апа V=(X,Y,U), where Хо, Yọ are fixed matrices and y is a positive 
number. Then 


1 1 1 {77 90 
(27) оу = = OXY, +—trW'¥,+ <tr} 0 yl O}]>c 
Y 9 y 02 
0 0 0 
for sufficiently large у. On the other hand, 
Q8) X(M(V)) = A(x'(U - Y) "' xj 
И I 9 0 И т! 
= 3 (В) |o уг о|р-:– nr] x, 
Y 0 ог 


20 


as y — o. Therefore, ИЄ А for sufficiently large y. This is a contradiction. 
Hence 6 is positive semidefinite. 

Now let œ; correspond to (,,0, Г), where Ф, #0. Then 14-40 is 
positive definite and Ф, + АФ + 0 for sufficiently large A. Hence w, + Ao Е 
Q — Qo for sufficiently large A. a 


The preceding proof was suggested by Charles Stein. 
By Theorem 5.6.5, Theorem 8.10.3 and Lemma 8.10.8 now imply Theorem 
8.10.2. 


To obtain Theorem 8.10.1 from Theorem 8.10.2, we use the following 
lemmas. 


Lemma 8.10.9. ACR” is convex and monotone in majorization if and 
only if A is monotone and A* is convex. 
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À2 


(Ag, Aq) 
(0, А) @= extreme points 


А = (A), А) 


Al 


D(A) (Ay, 0} 
Figure 8.5 


Proof. Necessity. If A is monotone in majorization, then it is obviously 
monotone. A* is convex (see Problem 8.35). 


Sufficiency. For X € Кт let 


C(A) = {xlix ER”, x >A}, 
(29) D(A) = {xlr ERZ , x> „А}. 


It will be proved in Lemma 8.10.10, Lemma 8.10.11, and its corollary that 
monotonicity of A and convexity of A* implies C(A) C A*. Then DOA) = 
C(A) А" CA* ПА" = A. Now suppose ve RZ and vx „А. Then v е 
D(A)CA. This shows that A is monotone in majorization. Furthermore, if 
A* is convex, then A = К" ПА* is convex. (See Figure 8.5.) a 


Lemma 8.10.10. Let C be compact and convex, and let D be convex. If the 
extreme points of C are contained in D, then CCD. 


Proof. Obvious. " 


Lemma 8.10.11. Every extreme point of C(A.) is of the form 


(30) (8 Asa Srem Anem)» 
where т is a permutation of (1,...,m) and 8,— ¢ =ô, = 1, Ôp, = 7 = Òn 
= 0 for some k. 


Proof. C(A) is convex. (See Problem 8.34.) Now note that C(A ) is permu- 
tation-s\ mmetric, that is, if (x,,....- хн) €C(A), then Cory TS ED € 
C(A.) for any permutation т. Therefore, for any permutation т, 7(C(A)) = 
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Guy es Xam) 1 € СОА) coincides with C(A). This implies that if 
(xp... Xm) is an extreme point of C(A), then (хи... x, 1) is also an 
extreme point. In particular, (xij,..., Yim) Є К" is an extreme point. Con- 
versely, if (x,4...,x,)€ R" is an extreme point of С(А), then 
Gray so оно) is an extreme point. 


We see that once we enumerate the extreme points of С(А) in В”, the 
rest of the extreme points can be obtained by permutation. 

Suppose x € R% . An extreme point, being the intersection of т hyper- 
planes, has to satisfy т or more of the following 2m equations: 


E,:x, 70, CFA. 
E,:x,-0, Fix +х.=А, tÀ, 
(31) 
Ен: хи = 0 Ех, + bx, = А, + tA 


Suppose that К is the first index such that E, holds. Then хе” implies 


Q-2x,zx,,!,2-- mx,20. Therefore, E,,..., E,, hold. The remaining 
к-1=т- (тк + 1) or more equations are among the F’s. We order 
them as Fj,.... А, where ij -< ++ <i, /2k—1. Now i, <+ <i, implies 
ij21 with equality if and only if i, = 1,...,4 = /. In this case Е»... Е 
hold (/ > k -- 1). Now suppose i, > [. Since x, = + =x,, = 0, 
(32) Fixi + хрр SA Hoe AU $A; 


But ху+ +x) € А, + ФА р and we have A, + o А = 0. There- 
fore, O= A, + ^ +A, 2A, > + mA, 20. In this case F,_),...,F, reduce 
to the same equation x, += +x,_, =A, ++ +A,_,. It follows that x 
satisfies k — 2 more equations, which have tc be F,...,F,.,. We have 
shown that in either case E,,...,£,,,F,,...,£,_, hold and this gives the 
point В = (Л,,.... 4, ,,0,...,0), which is in А" ССА). Therefore, В is an 
extreme point. a 


Corollary 8.10.1. С(А) С А*. 


Proof. If А is monotone, then A* is monotone in the sense that if 
Х = (Л,..... Ан) €A*, vo Qu... v, И<А, = 1,...,т, then v eA*. 
(See Problem 8.35.) Now the extreme points of C(A) given by (30) are in A* 
because of permutation symmetry and monotonicity of A*. Hence, by Lemma 


8.10.10. C(A) C A*. и 
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Proof of Theorem 8.10.1. Immediate from Theorem 8.10.2 and Lemma 
8.10.9. a 


Application of the theory of Schur-convex functions yields several corollar- 
ies to Theorem 8.10.2 


Corollary 8.10.2. Let в be continuous, nondecreasing, and convex in [0, 1). 
Let 


(33) FOD e fO An) = EQ). 


Then a test with the acceptance region A = (X|f(X) < c) is admissible. 


Proof. Being a sum of convex functions f is convex, and hence А is 
convex. A is closed because f is continuous. We want to show that if 
f(x) < cand y < „х (x, y € R7), then Ку) x c. Let £, = X ax, ў, = Xy, 
Then y < „х if and only if £, 2 5,, k — 1,..., m. Let f(x) = A(,,...,£,)-— 
8) + X. g(¥; —¥,_,). It suffices to show that A(£,,..., žm) is increasing 
in each x,. For і< m — 1 the convexity of g implies that 


(34) h(5,,. Fj + 850005 xS) (Я.Я. EL) 
-g(xit e) -g(x)) — (861) – 8 (X41 —=)} z 0. 


For i = т the monotonicity of в implies 
(35) h(£,...,£,* e) A, sEm) -g(x, + e) - (хи) 20. " 


Setting g(A)= —log(1 — А), g0)—A/(0 — А), g(A)=A, respectively, 
shows that Wilks’ likelihood ratio test, the Lawley—Hotelling trace test, and 
the Bartlett-Nanda-Pillai test are admissible. Admissibility of Roy's maxi- 
mum root test А: A <c follows directly from Theorem 8.10.1 or Theorem 
8.10.2. On the contrary, the minimum root fest, Л, < c, where г = min(m, p), 
does not satisfy the convexity condition. The following theorem shows that 
this test is actually inadmissible. 


Theorem 8.10.4. А necessary condition for an invariant test to be admissible 
is that the extended region in the space of V/A, ,..., yA, is convex and monotone. 


We shall only sketch the proof of this theorem [following Schwartz (1967)]. 
Let V/A; 7 dj, i — 1,..., t, and let the density of d,,...,d, be f(d|v), where 
v = (v,,..., vj)' is defined in Section 8.6.5 and f(d|v) is given in Chapter 13. 
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The ratio f(d|v)/f(d|0) can be extended symmetrically to the unit cube 
(0xd,x1, i—1,...,t). The extended ratio is then a convex function and is 
strictly increasing in each d;. A proper Bayes procedure has an acceptance 
region 


(36) | Ii. аъ) <c, 


where II(v) is a finite measure on the space of v's. Then the symmetric 
extension of the set of d satisfying (36) is convex and monotone [as shown by 
Birnbaum (1955)]. The closure (in the weak» topology) of the set of Bayes 
procedures forms an essentially complete class [Wald (1950)]. In this case the 
limit of the convex monotone acceptance regions is convex and monotone. 
The exposition of admissibility here was developed by Anderson and 
Takemura (1982). 


8.10.2. Unbiasedness of Tests and Monotonicity of Power Functions 


A test T is called unbiased if the power achieves its minimum at the null 
hypothesis. When there is a natural parametrization and a notion of distance 
in the parameter space, the power function is monotone if the power 
increases as the distance between the alternative hypothesis and the null 
hypothesis increases. Note that monotonicity implies unbiasedness. In this 
section we shall show that the power functions of many of the invariant tests 
of the general linear hypothesis are monotone in the invariants of the 
parameters, namely, the roots; these can be considered as measures of 
distance. 

To introduce the approach, we consider the acceptance interval (—а, а) 
fer testing the null hypothesis и = 0 against the alternative u0 on the 
basis of an observation from № џи, a ?). In Figure 8.6 the probabilities of 
acceptance are represented by the shaded regions for three values of р. It is 
clear that the probability of acceptance decreases monotonically (or equiva- 
lently the power increases monotonically) as м moves away from zero. In 
fact, this property depends only on the density function being unimodal and 
symmetric. 


=a u-0a -a Opa -a0 ap 


Figure 8.6. Three probabilities of acceptance. 
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Figure 8.7. Acceptance regions. 


In higher dimensions we generalize the interval by a symmetric convex set, 
and we ask that the density function be symmetric and unimodal in the sense 
that every contour of constant density surrounds a convex set. In Figure 8.7 
we illustrate that in this case the probability of acceptance decreases mono- 
tonically. The following theorem is due to Anderson (1955b). 

Theorem 8.10.5. Let E be a convex set т n-space, symmetric about the 
origin. Let f(x) = 0 be a function such that (i) f(x) — fl —х), Gi) (xlfG0 > и} = 
К, is convex for every и (0 <и < %), and Gii) fg f(x) dx < х. Then 


(37) (0 [ftem [foede 


forüx kx1. 

The proof of Theorem 8.10.5 is based on the following lemma. 

Lemma 8.10.12. Let E, Е be convex and symmetric about the origin. Then 
(38) W((E - y) OF} 2V((E+y) OF}, 
where 0 < k < 1 and V denotes the n-dimensional volume. 

Proof. Consider the set a(E+y)+ (1 -аХЕ -y) =аЕ + - а) Е + 
(2a — 1)y which consists of points a(x +у) -aXz-y with х,2Є Е. 
Let aj = (k + 1) /2, so that 2a — 1 =k. Then by convexity of E we have 
(39) a(E+y)+(l-a)(E-y) CE + А. 
Hence by convexity of F 


а [(Е ty) nF] + (1- а (Е -У) ПЕ] <(Е+Ю) ПЕ 
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and 
(40) И(®[(Е+у) ПЕ] + (37 &)ICE 7») nF])V((E*-&)nF). 


Now by the Brunn-Minkowski inequality [e.g., Воппезеп and Fenchel (1948), 
Section 48], we have 


(41) Иа (Е +5) ПЕ] + (1 ав) [(Е у) ПЕ} 
> aV V^ (E y) ПЕ} + (1- a) ИУ" ((Е у) ПЕ} 
= щи!" (Е +y) ПЕ} + (1 а)" ((-Е+у) п (-Е)} 
=И"/"((Е+у) ПЕ}. 

The last equality follows from the symmetry of E and F. и 


- Proof of Theorem 8.10.5. Let 


(42) Н(и) = V((E + ку) A Ka} 
(43) H* (u) = V(E +y) n K,]. 
Then 

, г = dx 

(44) Јо +) & JO 


oo 
=f f locus fo) dudx 
E+y”0 


= [f losu 5л) dxdu 
0 “E+y 
= f H* (u) du. 
0 
Similarly, 
e JA +o) dx= | но) du. 


Ву Lemma 8.10.12, H(u) > H* (u). Hence Theorem 8.10.5 fotlows from (44) 
апа (45). " 
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We start with the canonical form given in Section 8.10.1. We further 
simplify the problem as follo vs. Let t = тіп(т, р), and let vj,..., v, (vj = v; 


mo om) be the nonzero characteristic roots of &'E^! E, where E = £X. 


Lemma 8.10.13. There exist matrices B (p X p) and F (m x m) such that 


(46) 


where D, = diag(v,,..., v.). 


Proof. We prove this for the case p <m and v, > 0. Other cases can be 


proved similarly. By Theorem A.2.2 of the Appendix there is a matrix B such 
that 


(47) ВУВ'=1, BEE'B'=D,. 
Let 

(48) F,-D;iBE (pxm). 
Then 

(49) ЕЕ =L, 


Let F' = (Fi, Е) be a full m х т orthogonal matrix. Then 


(50) BRF,-DiF,F,-0 

and 

(51) ВЕР’ - BE(F,, Еу) = BE(R'B'D;, F;) = (01,0). и 
Now let 

(52) 0 = ВХЕ', V=BZ. 


Then the columns of U,V are independently normally distributed with 
covariance matrix Г and means when p xm 


&U = (1,0), 
éV=0. 


(53) 
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Invariant tests are given in terms of characteristic roots ly...,l m mI 
of U (VV')^!U. Note that for the admissibility we used the characteristic 
roots of A; of U'(UU' + VV")  U rather than J, = A,/(1 — A,). Here it is more 
natural to use /;, which corresponds to the parameter value v; The following 
theorem is given by Das Gupta, Anderson, and Mudholkar (1964). 


Theorem 8.10.6. If the acceptance region of an invariant test is convex in 
the space of each column vector of U for each set of fixed values of V and of the 


other column vectors of U, then the power of the test increases monotonically in 
each v;. 


Proof. Since UU' is unchanged when any column vector of U is multiplied 


by —1, the acceptance region is symmetric about the origin in each of the 
column vectors of U. Now the density of U — (н) V= (o) is 


(54) f(U,V) 


т 

2 
L ui; 
1j=1 
j*i 


Me 


i t 
= (20) (Pub -4 vL (us Vv.) + 


i 


Applying Theorem 8.10.5 to (54), we see that the power increases monotoni- 
cally in each /»;. " 


Since the section of a convex set is convex, we have the following corollary. 


Corollary 8.10.3. If the acceptance region A of an invariant test is convex in 
U for each fixed V, then the power of the test increases monotonically in each v, 


From this we see that Roy's maximum root test A:l,<K and the 


Lawley—Hotelling trace test A:tr U'(VV') !U < К have power functions that 
are monotonically increasing in cach в. у 


To see that the acceptance region of the likelihood ratio test 
1 
(55) A: Па+1) <K 
ist 
satisfies the condition of Theorem 8.10.6 let 


W')'-TT, T:px 
(56) (Ww) pxp 
U* = (uf,..., ut) = TU. 
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Then 
= L 
(57) Math) от +1 


i=} 
-|U*U*' «I| -|utut' +В] 
- (ut'B^'ui + 1)|Bl 
= (u,T'B^'Tu, + 1B, 


М since TB ^T is positive definite, (55) is 
where B —u$u$' + tus us, + Г. Since T B is definie, ch 
convex іп шу. Therefore, the likelihood ratio test has a powe 
is monotone increasing in each v;. 


The Bartlett-Nanda- Pillai trace test 
t 
oly i 
(58) A: tr U'(UU' + VV?) U= У тзт = К 


has an acceptance region that is an ellipsoid if К < 1 and is convex in cach 
column u; of U provided K < 1. (See Problem 8.36.) For K> 1 (58) may no 
be convex in each coh mn of U. The reader can work out an example for 


= 2. Я . B M ta А ` 
P Eaton and Perlman (1974) have shown that if an invariant test 1$ convex in 
U and W = VV", then the power at (и0,..., v?) is greater than at CORE pi 
mi > 


Gv, У», >< o»? v9). We shall not prove this result. Roy's 
pee V4) < sy 


maximum root test and the Lawley - Hotelling trace test satisfy the condition, 

but the likelihood ratio and the Bartlett-Nanda- Pillai trace test о, M w 

Takemura has shown that if the acceptance region is convex in MS 

the set of yn: ee, yv. for which the power is not greater thanac s 
and convex. 

DOM is enlightening to consider the contours of the power Anot g 

DI»... уи) Theorem 8.10.6 does not exclude case (a) of Fig .8. 


№2 
я 
и 
a 
Ve; 
я 
" 
о 
Ves 
я 
i 


va Ya » 
(a) (b) (c) 


Figure 8.8. Contours of power functions. 
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and similarly the Eaton-Perlman result does not exclude (b). The last result 
guarantees that the contour looks like (c) for Roy's maximum root test and 
the Lawley~Hotelling trace test. These results relate to the fact that these 
two tests are more likely to detect alternative hypotheses where few v,’s are 
far from zero. In contrast with this, the likelihood ratio test and the 
Bartlett-Nanda- Pillai trace test are sensitive to the overall departure from 
the null hypothesis. It might be noted that the convexity in Ух v -space cannot 
be translated into the convexity in израсе. 

By using the noncentral density of /,’s which depends on the parameter 
values v,,..., и, Perlman and Olkin (1980) showed that any invariant test 
with monotone acceptance region (in the space of roots) is unbiased. Note 
that this result covers all the standard tests considered earlier. 


8.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


8.1L.1. Observations Elliptically Contoured 


The regression model of Section 8.2 can be written 
(1) x, = Bz, +е., a=1,...,N, 


where e, is an unobserved disturbance with &e,=0 and ée,e, = X. We 
assume that e, has a density |Al~ ?g(e’A7'e); then У = (ФК /р)А, where 
R? еле. In gencral the exact distribution of B = EN. ix, z' A^! and 
МУ = Y, — Bz, Xx, — Bz,)' is difficult to obtain and cannot be ex- 
pressed concisely. However, the expected value of B is B, and the covariance 
matrix of vec B is X 8A^! with 4A— X ,z,z,. We can develop a large- 
sample distribution for B and NÊ. 


Theorem 8.11.1. Suppose (1/N)A > А, z,z, «constant, а = 1,2,. 
and either the e „s are independent identically distributed or the e,'s are indepen- 
dent with é\e',e,|°** < constant for some => 0. Then B 5 P and VN vec( B — 
B) has a limiting normal distribution with mean 0 and covariance matrix 
X84, 


Theorem 8.11.1 appears in Anderson (1971) as Theorem 5.5.13. There are 
many alternatives to its assumptions in the literature. Under its assumptions 


Ê n 5 £. This result permits a large-sample theory for the criteria for testing 
null hvpotheses about B. 
Consider testing the null hypothesis 


(2) н: В = В“, 
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where B" is completely specified. In Section 8.3 a more general hypothesis 
was considered for B partitioned as B = (B,, B,). However, as shown in that 
section by the transformation (4), the hypothesis B, = В* can be reduced to a 
hypothesis of the form (1) above. 


Let 
N A, 
(3) | G- у; (х. 7 Bz,)(x, – Bza) = МХ о, 
а= 1 
(4) H-(B-B)A(B – В)’. 


Lemma 8.11.1. Under the conditions of Theorem 8.11.1 the limiting distri- 
bution of Н is ТС, д). 


Proof. Write H as 
1 
(5) H - /N(B-B)gA/N(B-B)'. 
Then the lemma follows from Theorem 8.11.1 and (4) of Section 8.4. a 


We can express the likelihood ratio criterion in the form 


(6) —2log A= -N log U =N log|I + G^! H| 
= №1 (16) н 
А 08i t R\N ' 


Theorem 8.11.2. Under | the conditions of Theorem 8.11.1, when the null 
hypothesis is true, 


d 
(7) -2log A > x. 


Proof. We use the fact that № 10211 + N! C| = tr C + O(N^!) when N > 
œ, since |I -- xC| = 1-- x tr € + O(x?) (Theorem A.4.8). 


We have 


po Poq 
(8) (зс) H-N У: X 8” (Dig — Big) nl by, ~ Bin) 
i,j=1 g,h-1 


= [vec(.B’ - в) (6 @ 4) vee в" -p) 5 хр, 


because (1/N)G 5 X, (1/N)A >A, and the limiting distribution of 
VN vec( В’ - В’) is NCE 8 Ag). a 
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Theorem 8.11.2 agrees with the first term of the asymptotic expansion of 
~2log А given by Theorem 8.5.2 for sampling from a normal distribution. 
The test and confidence: procedures discussed in Sections 8.3 and 84 can be 
applied using this x ?-distribution. 

The criterion U = A?/ can be written as U - ПРИ, where И, is defined 
in (8) of Section 8.4. The term V; has the form of U; that is, it is the ratio of 
the sum of squares of residuals of Xj, Tegressed on x,,,..., Xii a Za to the 
Sum regressed on x,,,..., Xj 1, a- It follows that under the null hypothesis 
Vi... V, are asymptotically independent and ~N log V, 5 х2. Thus 
-N log U = – МУР log и5 х2. This argument justifies the Step-down 
procedure asymptotically. 

Section 8.6 gave several other criteria for the general linear hypothesis: 
the Lawley—Hotelling trace tr HG^!, the Bartlett-Nanda- Pillai trace tr H(G 
* H)'!, and the Roy maximum root of HG-' or H(G + H) !. The limiting 
distributions of N tr НС! and N tr H(G + Н)! are again Xpq- The limiting 
distribution of the maximum characteristic root of МНА! or NH(G + H)^! 
is the distribution of the maximum characteristic root of H having the 
distributions W(I, q) (Lemma 8.11.1). Significance points for these test crite- 
ria are available in Appendix B. 


8.11.2. Elliptically Contoured Matrix Distributions 


In Section 8.3.2 the p x № matrix of Observations on the dependent variable 
was defined as X = (x,,..., Xy), and the q х N matrix of Observations on the 
independent variables as Z = (zj,...,zy the two matrices are related by 
4X = BZ. Note that in this chapter the matrices of observations have N 
columns instead of N rows. 

Let E-(e,...,ey) be a PXN random matrix with density 
| Al NA [F7 ! EE'(F?)-1] where A — FF'. Define X by 


(9) X - BZ + Е. 

In these terms the least squares estimator of B is 

(10) В=Х2'(22') 1 = СА-!, 

where С = XZ' = Уух, and А = ZZ' = EN 11.2... Note that the density 
of E is invariant with respect to multiplication on the right by NXN 


orthogonal matrices; that is, E' is left spherical. Then E' has the stochastic 
representation 


(11) Е’ 4 UTF', 
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i istributi 'U= is the lower triangular 
where U has the uniform distribution on U'U I. T 1 M 
matrix with nonnegative diagonal elements satisfying , voi 
is a lower triangular matrix with nonnegative diagonal elements satisfying 
FF' = X. We can write 


d Li + a - 

(12) B-B-EZ'A'! =FT'U'Z'A"', 
d + ғ + — О 
(13) Н=(в-В)А(В-В)' = EZ'A"!ZE' = FT'U'(Z'A" Z)UTF'. 


(14) G -(X- BZ)(X- BZ) -H-EE'-H 
= E( Iy — Z'A^Z)E' = FT'U'(Iy — Z'A ' Z)UTF'. 


ual D qul -B=0. 
It was shown in Section 8.6 that the likelihood ratio criterion for H:B . 

p - ills 2l е- 
the Lawley-Hotelling trace criterion, the Bartlett Nanda Pillai mace an 
i i t test are invariant with respec : 

n, and the Roy maximum гоо > invaria ; | 
transformations x Kx. Then Corollary 4.5.5 implies the following theorem 


Theorem 8.11.3. Under the null hypothesis B — 0. the distribution of each 
invariant criterion when the distribution of E' is left spherical is the same as the 


distribution under normality. 


Thus the tests and confidence regions described in Section 8.7 are valid 


- i istributions Е’. | 
Ие matices yxp and I, — Z'A^!Z are idempotent of ranks q and 


N — q. There is an orthogonal matrix О» such that 


0 0 
(15) 02'А-!20’ = | ol O(Iy — Z'A`'Z) | hs 
The transformation V = O'U is uniformly distributed on V'V = /,, and 
0 0 
jh? ' c= |, 
(16) Н = КУ |" AZ , о 
where К = FT’. l 
The trace criterion tr HG~', for example, is 
-1 
I, 0 [hp 
(17) tr HG = И Е viv [o i. 


iteri 5 7 V), not 
The distribution of any invariant criterion depends only on U (or 


on T. 
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Since G + H = FT'TF', it is independent of U. A selection of a linear 
transformation of X can be made on the basis of С + Н. Let D be a pxr 
matrix of rank г that may depend on С + Н. Define х* = D'x,. Then 
xx. = (D’'B)z,, and the hypothesis B=0 implies D‘B=0. Let X* = 
Gt... xy) = D'X, Bp = D'B, Ey; = D'E, Н, 2 D'HD, Gp = D'GD. Then 
Ej = E'D£ UTF'D'. The invariant test criteria for B» = 0 are those for 
В = 0 and have the same distributions under the null hypothesis as for the 
normal distribution with p replaced by r. 


PROBLEMS 
8.1. (Sec. 8.2.2) Consider the following sample (for N = 8): 


Weight of grain 40 17 9 15 6 12 5 9 
Weight of straw 53 19 10 29 13 27 19 30 
Amount of fertilizer 24 11 5 12 7 14 11 18 


Let 25, = 1, and let z,, be the amount of fertilizer on the ath plot. Estimate В 
for this sample. Test the hypothesis B, = 0 at the 0.01 significance level. 


8.2. (Sec. 8.2) Show that Theorem 3.2.1 is a special case of Theorem 8.2.1. 
[Hin: Let д=1,2,=1,В = и] 


8.3. (Sec. 8.2) Prove Theorem 8.2.3. 


84. (Sec. 8.2) Show that B minimizes the generalized variance 


N 


У (x, 7 Bz,)(x, - Bz,)']. 


a=l 


8.5. (Sec. 8.3) In the following data [Woltz, Reid, and Colwell (1948), used by 
В. L. Anderson and Bancroft (1952)] the variables are хи, rate of cigarette burn; 
ху. the percentage of nicotine; z,, the percentage of nitrogen; z,, of chlorine; 
Z3, Of potassium; z,, of phosphorus; 25, of calcium; and 25, of magnesium; and 
z5,7 1; and № = 25: 


53.92 


N 
А 2v _ [0.6690 0.4527 
У (х, X), 73) = (0560 o5]. 


aol 
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N 
X (2. -z)(z, -zy 
a=] 


1.8311  —0.3589 -0.0125  — 0.0244 1.6379 0.5057 
—0.3589 . 8.8102  —0.3469 0.0352 0.7920 0.2173 
—0.0125  —0.3469 1.5818 —0.0415  —1.4278 —0.4753 

= | —0.0244 0.0352 —0.0415 0.0258 0.0043 0.0154 

1.6379 0.7920 — 1.4278 0.0043 3.7248 0.9120 

0.5057 0.2173 -0.4753 0.0154 0.9120 0.3828 

0 0 0 0 0 0 


осооо оо 


0.2501 2.6691 
— 1.5136 ~ 2.0617 
N 0.5007 —0.9503 
È (z-2)(x,-x)-|-00401 -0.0187 
a=1 | —0.1914 3.4020 
—0.1586 1.1663 

0 0 


(a) Estimate the regression of x, and x, on 21, 25, Zg, and z;. 
(b) Estimate the regression on all seven variables. 
(c) Test the hypothesis that the regression on Z5, 23, and z, is 0. 


8.6. Gec. 8.3) Let q—2, z,,=w, (scalar), z;, = 1. Show that the U-statistic for 
testing the hypothesis B, = 0 is a monotonic function of a T*-statistic, and give 
the 7*-statistic in a simple form. (See Problem 5.1.) 
8.7. (Sec. 8.3) Let z,, = 1, let q, = 1, and let 
А* = Y (Zia 2))( Za -z) ; i,j71,...,0174- 1. 
а А А 


Prove that 


(Bio - Bi)( 4 -An AzAn) (Bia -B)y- (Bio -= В.) 4* (Bia - B,)’. 


8.8. (Sec. 8.3) Let q, = 42. How do you test the hypothesis B, = В,? 


8.9. (Sec. 8.3) Prove 


Д -1 
Bio = Mx (20 -Apn Ag zy Y (so - Ap Ag 2) (2 -AyAg : y 
а а 


= (C, = C; Az A )(A3 -ApAxAa) . 
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8.10. (Sec. 8.4) By comparing Theorem 8.2.2 and Problem 8.9, prove Lemma 8.4.1. 
8.11. (Sec. 8.4) Prove Lemma 8.4.1 by showing that the density of Bio and В, is 
К; exp[ - itr X^ (Bio - Br) 4i(Bis - BY)’] | 
Ka exp —3tr > (Ê, -= B:) 42 (B;., -= g.)] . 
812. (Sec. 8.4) Show that the cdf of Из is 


(таз) + I * 2T[2( +D] 
T(n - 1)T (5n - 1) 
(аиа Whee 


am- + р [aresin(2u — 1) – т] 


" vu 3(n +1) 


gion {0 <z <1,0 <z <1,z2z < . 
<u} and (0 <z; <u/z,, usns EY 122€ u] is the 


+21 ы =H E et 
7 | 


[ Hint: Use Theorem 8.4.4. The re 
union of (0 <z; < 1,0 <z, 


8.13. (Sec. 8.4) Find Pr(U, , , >и). 
8.14, (Sec. 8.4) Find Pr(U, , >и). 
8.15. (Sec. 8.4) For p < m find ccEU^ 


fact that the density of K +Y! 
W(X, 5) and V... 


from the density of G and H. [ Hint: Use the 
у ИИ is ИС, 5+0) if the density of K is 
и are independently distributed as N(9, X).] 


8.16. (Sec. 8.4) 


(a) Show that when p is even, the characteristic function of Y= 
olt) = «E e''*, is the reciprocal of a polynomial р 

(b) Sketch а method of invertin 
method of residues. 


(c) Show that the resultin i i 
i g density of U is ial i 
possibly a factor of и- 1, à polynomial in Vi 


log Up, m,n» SAY 


g the characteristic function of Y by the 
and log u with 


8.17. (Sec. 8.5) Use the asym 


totic e i Е ИИ 
log 0,5, < M*) for P xpansion of the distribution to compute Pr{—k 


(a) n=8, M* «147, 
(b п=8, M*=21.7, 
(© n=16, M* «147, 
(d) n= 16, M* = 217. 


(Either compute to the third decimal 


term.) place or use the expansion to the k-^ 
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8.18. (Sec. 8.5) In case p= 3, q, = 4, and n =N — q = 20, find the 50% significance 
point for k log О (a) using —2108 A as x^ and (b) using – log U as x^. Using 
more terms of this expansion, evaluate the exact significance levels for your 
answers to (a) and (b). 


8.19. (Sec. 8.6.5) Prove for l; 2 0, і = 1,..., р, 


р | р р 
У i£] xlogIIG +!) < Уг. 
i i=l ied 


i=l i 


Comment: The inequalities imply an ordering of the values of the 
Bartlett-Nanda-Pillai trace, the negative logarithm of the likelihood ratio 
criterion, and the Lawley-Hotelling trace. 


8.20. (Sec. 8.6) The multivariate beta density. Let H and G be independently dis- 
tributed according to ИСХ, т) and ИСХ, n), respectively. Let С be a matrix 
such that CC' = H + G, and let 


L=C'HC''. 
Show that the density of L is 


Б 
PEG +1) VALLE T-L] Mn-p-1) 
TCs) 0,02") 


for L and I—L positive definite, and 0 otherwise. 


8.21. (Sec. 8.9) Let Y; (a p-component vector) be distributed according to N(p.;,. =). 
where EY, = p= p. X; + vj + vij, LAV =O = Ejvj- Livy = Ej the vy; 
are the interactions. If m observations are made on each Y; (say y;j..... Ут). 
how do you test the hypothesis А, = 0, і = 1,...,7? How do you test the 
hypothesis y;; = 0, i = 1,...,7,f=1,...,¢7 


822. (Sec. 8.9) The Latin square. Let Y,,, i, j =1,...,7, be distributed according to 
Np; £), where «БҮ; = в; = Y * M t vj + Be and k 2j —i 4 1 (mod r} with 
EA, = Evj = Ep, = 0. 


(a) Give the univariate analysis of variance table for main effects and error 
(including sums of squares, numbers of degrees of freedom.. and mean 
squares). 

(b) Give the table for the vector case. 

(c) Indicate in the vector case how to test the hypothesis A; = 0, i= 1..... r. 


8.23. (Sec. 8.9) Let x, he the yield of a process and x. a qualitv measure. Let 
z, 71, z; = +10° (temperature relative to average) z= +0.75 (relative mea- 
sure of flow of one agent), and z, = + 1.50 (relative measure of flow of another 
agent). [See Anderson (19552) for details.] Three observations were made on x; 
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and x. for each possible triplet of values of z3, 23, and z,. The estimate of В is 


, 


a [58.529 —0.3829 -5.050 2308) 
B= 98.675 0.1558 4.144  —0.700 


5, = 3.090, s, = 1.619, and r = —0.6632 can be used to compute $ or 2. 


(a) Formulate an analysis of variance model for this situation. 

(b) Find a confidence region for the effects of temperature (i.e., 81, 822). 

(c) Test the hypothesis that the two agents have no effect on the yield and 
quantity. 


. (Sec. 8.6) Interpret the transformations referred to in Theorem 8.6.1 in the 
original terms; that is, H : B, = By апа 20. 


. (Sec. 8.6) Find the cdf of tr НС! for p = 2. [Hint: Use the distribution of the 
roots given in Chapter 13.] 


. (Sec. 8.10.1) Bartlett-Nanda-Pillai V-test as а Bayes procedure. Let 
нм... мп be independently normally distributed with covariance matrix 
X and means xEw; = y; i=1,..., m, Ew, =0,i=mt 1,...,m +n. Let Пу be 
defined by [T,, Z1 = [0, (7 + CC')*!], where the p x m matrix C has a density 
proportional to |J + CC'| — 2") and Г, = (y,,..., Ym); let II, be defined by 


[F,. £] - (+ CC)7!C,U + CC) !] where C has a density proportional to 
11+ СС'| - Hata) ог CUCC IC. 


(a) Show that the measures are finite for n >р by showing tr CU +C’) C 
< m and verifying that the integral of |Z + CC'| ^ 3" +") is finite. [ Hint: Let 
C= (e,,..., Cm) Dj o E E] ci = ЕЕ, cj Ejdi j = 1,...,m (Eo =D. 
Show [РЯ = |D,_,|(t + djd;) and hence 101 = ПА + 234). Then refer 
to Problem 5.15.] 


(b) Show that the inequality (26) of Section 5.6 is equivalent to 


mtn -l m 
. «[ У «ni Y ww > К. 
i=] 


i=l 
Hence the Bartlett-Nanda-Pillai V-test is Bayes and thus admissible. 


. (Sec. 8.10.1) Likelihood ratio test as a Bayes procedure. Let wj,...,Ww,44 be 
independently normally distributed with covariance matrix X and means Ew; 
= ү, i=1,...,m, «Еж = 0, i=m+1,...,m +n, with nzm-p. Let П, be 
defined by [F,, E] = [0, (7 + CC’)~'], where the p x m matrix С has a density 
proportional to |Z + CC'| ^ ^*^? and T, = (y,,..., Ym) let П, be defined by 


[r E] = [+ ce) en. (+ cc]. 
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8.28. 


8.29. 


8.30. 


8.31. 


8.32. 


8.33. 


8.34. 


8.35. 


where the т columns of D are conditionally independently normally distributed 
with means 0 and covariance matrix [I — C'(1 + CC')^!C]-!, and C has (margi- 
nal) density proportional to 


Ire cc] mr cre cc gi. 


(a) Show the measures are finite. [ Hint: See Problem 8.26.] 
(b) Show that the inequality (26) of Section 5.6 is equivalent to 


[Ef ww) | 
Tem4n EK. 
(Erma WW; | 


Hence the likelihood ratio test is Bayes and thus admissible. 


(Sec. 8.10.1) Admissibility of the likelihood ratio test. Show that the acceptance 
region 1ZZ'| / |ZZ' + XX'| > c satisfies the conditions of Theorem 8.10.1. [ Hint: 
The acceptance region can be written [1/,,m;» c, where m;-l-A,i- 
1,...,¢.] 


(Sec. 8.10.1) Admissibility of the Lawley-Hotelling test. Show that the accep- 
tance region tr XX'(ZZ^)^! «c satisfies the conditions of Theorem 8.10.1. 


(Sec. 8.10.1) Admissibility of the Bartlett-Nanda- Pillai trace test. Show that the 


acceptance region tr X'(ZZ' + ХХ')-!Х < с satisfies the conditions of Theorem 
8.10.1. 


(Sec. 8.10.1) Show that if A and В are positive definite and A — B is positive 
semidefinite, then B^! — AT! is positive semidefinite. 


(Sec. 8.10.1) Show that the boundary of A has m-measure 0. [ Hint: Show that 
(closure of A) CA UC, where C = {V1 U — YY’ is singular).] 


(Sec. 8.10.1) Show that if A CR". is convex and monotone in majorization, 
then A* is convex. [ Hint: Show 


(px QM). > „рх, +ay,, 
where 
2, = (21...42) ERZ] 


(Sec. 8.10.1) Show that C(A) is convex. [ Hint: Follow the solution of Problem 
8.33 to show (рх + qy) < „А if x< „A and y «,A] 


(Sec. 8.10.1) Show that if A is monotone, then A* is monotone. [ Hint: Use 
the fact that 


х= ‚тах [min(x, , -9х))}] 
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8.36. 


8.37. 


8.38. 


8.39. 
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(Sec. 8.10.2)  Monotonicity of the power functi - illai 
о 2400 Моп I р function of the Bartlett-Nanda-Pillai 


tr (uu! + B)(uu! + B+ И) <K 


is convex in и for fixed positive semidefinite В and positiv ini i 
e defi 
0 € K <1. [Hint: Verify | P mie P+W if 
(a + B+ Wy! 


1 


= (в+ ж)! 
1+и (В+) 1и 


(B+W) 'uu'(B+WY'. 


The resulting quadratic form in и involves the matrix (tr4)I—4 for A= 


(B + W)-2B(B + W)- 3; show that this matrix is positi idefini i 
naling A] ; show that this matrix is positive semidefinite by diago- 


(Sec. 8.8) Let xQ, а= 1,..., N,, be observations from М, X), v 2 1,...,q. 
What criterion may be used to test the hypothesis that | 


3 


B= Усы +в, 
А h-l 


where Chy are given numbers and y,,m are unknown vectors? [ Note: This 
ypothesis (that the means lie on an m-dimensional hyperplane with ratios of 
distances known) can be put in the form of the general linear hypothesis.] 


(Sec. 8.2) Let х, be an observation from N(Bz,, £), а= 1,..., М. Suppose 
there is a known fixed vector such that By — 0. How do you estimate B? 


(Sec. 8.8 What is the largest group of transformations on у, а= 1,..., № 
i — 1,..., q, that leaves (1) invariant? Prove the test (12) is invariant under this 
group. 


CHAPTER 9 


Testing Independence of 
Sets of Variates 


9.1. INTRODUCTION 


In this section we divide a set of p variates with a joint normal distribution 
into q subsets and ask whether the q subsets are mutually independent; this 
is equivalent to testing the hypothesis that each variable in one subset is 
uncorrelated with each variable in the others. We find the likelihood ratio 
criterion for this hypothesis, the moments of the criterion under the null 
hypothesis, some particular distributions, and an asymptotic expansion of the 
distribution. 

The likelihood ratio criterion is invariant under linear transformations 
within sets; another such criterion is developed. Alternative test procedures 
are step-down procedures, which are not invariant, but are flexible. In the 
case of two sets, independence of the two sets is equivalent to the regression 
of one on the other being 0; the criteria for Chapter 8 are available. Some 
optimal properties of the likelihood ratio test are treated. 


9.2. THE LIKELIHOOD RATIO CRITERION FOR TESTING 
INDEPENDENCE OF SETS OF VARIATES 


Let the p-component vector X be distributed according to N(p, X). We 
partition X into q subvectors with pj, рз,..., p, components, respectively: 
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that is, 

xo 

xo 
(1) Y= . 

xo 


The vector of means p and the covariance matrix X are partitioned similarly, 


po 
p 
(2) p=]. | 
ae 
XQ X X, 
Xa En = X, 
(3) X-| о 4 
Ys o E 
. The null hypothesis we wish to test is that the subvectors X‘,...,X are 


mutually indepen distrituted, that is, that the density of x factors into 
the densities of X",..., ХФ. It is 


q 

(4) H:n(xlp, X) = [ In(x?lp 9, x). 
i=1 

If ¥,...,X@ are independent subvectors, 


(5) (x= p(x — pi)’ = 51;= 0. іж]. 


(See Section 2.4.) Conversely, if (5) holds, then (4) is true. Thus the null 
hypothesis is equivalently H : X; = 0, i + j. Thi: can be stated alternatively as 
the hypothesis that Х is of the form 


X, 0 0 

0 X4 0 

(6) = : : 
оо X. 


Given a sample x,,...,xy of № observations on X, the likelihood ratio 
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criterion is 


(7) _ тахь. s, (м, Xo) 
max, x (и, x) , 
where 
: N 
(8) L(p,%) = П ею, 


i (2m) 12] 5 


and L(p, £o) is (и, X) with £j = 0, i * j, and where the maximum is taken 
with respect to all vectors р, and positive definite X and £X, Ge, Ej). As 
derived in Section 5.2, Equation (6), " 


(9) такі (р, X) = —— ие" WN 
(2т yrs, [М И 
where 
(10) $a bast Y x x) 
n N^". N L (Xa xX)G 73) . 
Under the null hypothesis, 
4 
(11) у L(p, Xo) = JIL (B®, E) 
i=] 


where 


N 
OD (к) Пр 


e За - y Oy X Fa) — py) 


а-1 (27) |: 

Clearly 
(13) max L(y, F в) = П max L(w, х.) 

q 

1 
= TJ —— es 
i (2m) PME Hina 
pN 1 IN e PN, 

^ Qu)! Iz iiS JM 

where 
А А.с 

(14) Siw = У (x?-x9)ax0-zey. 


a-i 
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If we partition A and Ê, as we have X, 


A, 4p Aq Ê, $, i, 

A A $ $ $ 
as 4-|^" fm T wp o g.[ f» 7 Faf 
An Ag Aa Xu Xa $. 


we see that Ê, = Ê; = (1/N)A,. 
The likelihood ratio criterion is 


Шах x, L(p, Xo) _ 


16) A= Гај? 
(16) тах, у (р.х) П 


The critical region of the likelihood ratio test is 


(17) А5 Җе), 


where лғ) is a number such that the probability of (17) is = with X = X. (It 
remains to show that such a number can be found.) Let 


(18 мо 
) Y- Ld 


Then А = V?" is a monotonic increasing function of V. The critical region 
(17) can be equivalently written as 


(19) V<V(e). 


Theorem 9.2.1. Let x,,..., xy be a sample of N observations drawn from 
N(w, X), where x, в, and È are partitioned into p,,..., p, rows (and columns 
in the case of X) as indicated in (1), (2), and (3). The likelihood ratio criterion 
that the q sets of components are mutually independent is given by (16), where A 
is defined by (10) and partitioned according to (15). The likelihood ratio test is 
given by (17) and equivalently by (19), where V is defined by (18) and Ae) or 
V(e) is chosen to obtain the significance level =. 


Since rjj = à;j/ уа;а,; , we have 


p 
(20) |Al = [RI Па» 
i-i 
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where 
Ry Ry Ry 
Ra Ry R5, 
(21) R = (rij) = : : : 
Ra Ro Rag 
and 
pito жр; 
(22) |44 = К II aj. 
jepitoc tpi +1 
Thus 
|Al IRI 
23 =. 
(23) Пл = TIR; 


That is, V can be expressed entirely in terms of sample correlation coeffi- 
cients. 

We can interpret the criterion V in terms of generalized variance. Each 
set (x4,..., Xy) can be considered as a vector in N-space; the let (x; — 
Xp. Xin 7X) =Z; Say, is the projection on the plane orthogonal to the 
equiangular line. The determinant |A| is the p-dimensional volume squared 
of the parallelotope with z,,...,z, as principal edges. The determinant [А ‚1 
is the p,-dimensional volume squared of the parallelotope having as principal 
edges the ith set of vectors. If each set of vectors is orthogonal to each other 
set (1.е., R; 7 0, і +), then the volume squared |A| is the product of the 
volumes squared |4,1. For example, if p —2, pi =P2= 1, this statement is 
that the area of a parallelogram is the product of the lengths of the sides 


“if the sides are at right angles. If the sets are almost orthogonal; then |4l 


is almost ША, and V is almost 1. 
The criterion has an invariance property. Let C; be an arbitrary nonsingu- 
lar matrix of order р; and let 


с, 0 0 
0 C, 0 
(24) C= : 
0 0 C 


Let Cx, +4 = x5. Then the criterion for independence in terms of xj is 
identical to the criterion in terms of x,. Let А* = X,(x7 - x* Хх* —x*Y be 
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partitioned into submatrices A7. Then 


AR ж o xU) oa) ж ржу 
(23) AT = У (x x (х x ) 


=G (x? = P) (at? — zyc 
=C;4;C; 
and A* = CAC’. Thus 


14| САС" 
2 * = T = г 
(26) "= ая Пса] 


се Мі р 
HIC АС] TIA jl . 


for |C] = ШС. Thus the test is invariant with respect to linear transforma- 
tions within each set. 

Narain (1950) showed that the test based on V is strictly unbiased; that is, 
the probability of rejecting the null hypothesis is greater than the significance 
level if the hypothesis is not true. [See also Daly (1940).] 


93. THE DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 
WHEN THE NULL HYPOTHESIS IS TRUE 


9,3.1. Characterization of the Distribution 


We shall show that under the null hypothesis the distribution of the criterion 
V is the distribution of a product of independent variables, each of which has 
the distribution of a criterion U for the linear hypothesis (Section 8.4). 

Let 


Ay Ay i Aii | 
А; 11 Ai-i- 4,1 | 
А. ee A, i—i A 
(1) y= H e e E, i=2,...,q. 
Ay ee Ay i-a А 
: "М 
А; 1 Ai-ii-1 
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Then V= AA U V,. Note that И, is the N/2th root of the likelihood ratio 
criterion for testing the null hypothesis 


(2) H;:X4,—0,..,3,,.,-0, 


that is, that X is independent of (XO, XO) The null hypothesis 
H is the intersection of these hypotheses. 


Theorem 9.3.1. When Н, is truc, V; has the distribution of U, . __., where 
n=N- 1 and р; =p; +e +р;_1, i= 2,...,9. РьРьп-рр 


Proof. The matrix А has the distribution of У2-12.2,,, where Z, Z 
. . А V ? rt 
are independently distributed according to N(0, X) and Z, is partitioned as 
(Z£',...,Z(?'Y. Then conditional on ZY —z0,..., Zü-» = 24-0, g= 
i ‘ у... а = , = 
L.A, the subvectors 20,...,20 are independently distributed, 20 hav- 
ing а normal distribution with mean n 


20 
(3) B| : 
21-0 
and covariance matrix 
Xu Ea 
(4) У, - В, : В; 
Èi Xiii 
where 
X -1 
11 У a 
(5) В; = (У; UT Eii) : 
Xia UT Xiii- 


When the null hypothesis is not assumed, the estimator of B, is (5) with У, 

replaced by A,,, and the estimator of (4) is (4) with Ea replaced by (1 nA, 
and B; replaced by its estimator. Under A: B, =0 and the covariance matrix 
(4) is E, which is estimated by (1/n)4;. The N/2th root of the likelihood 


1 
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ratio criterion for H, is 


-1 
Ay Ay 3 Aj 
Ai (Ар А-1) : 
(6) А11 Aii i-l Aij 
Ми 
An A, 5-4 Aj 
Aii Aji int Ai, 
_ Ал Aiii Ai 
Ay Aiii , 


А11 Ut Aii, i-i 


which is И. This is the U-statistic for p; dimensions, р, components of the 


conditioning vector, and n — р, degrees of freedom in the estimator of the 
covariance matrix. и 


, Theorem 9.3.2. The distribution of V under the null hypothesis is the 
distribution of V V, --- V,, where V,,...,V, are independently distributed with V, 
having the distribution of О, р „рь where р, =p, + + t pj i. 


Proof. From the proof of Theorem 9.3.1, we see that the distribution of V, 

is that of О, 5, ;, not depending on the conditioning 200, k=1,...,i— 1, 

а = 1,...,п. Hence the distribution of V, does not depend on V5,..., V; ,. 

и 

Theorem 9.3.3. Under the null hypothesis V is distributed as ПЯ, П ВХ» 

Mus the X;/s are independent and X;, has the density B[x| 3(n — p; - 1— 
7532р; 


Proof. This theorem follows from Theorems 9.3.2 and 8.4.1. " 
9.3.2. Moments 


Theorem 9.3.4. When the null hypothesis is true, the hth moment of the 
criterion is 
P рр» 5 . 
(1) ev= п гп) alri- 


i=2 IH T[z(n -5e1-j]Trli(ne1-) +A] | 
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Proof. Because V,,...,V, are independent, 
(8) EV! = EVEEVE £V). 


Theorem 9.3.2 implies £V" = £U} р ,.;. Then the theorem follows by 
substituting from Theorem 8.4.3. и 


If the p, are even, say р; = 2r; {> 1, then by using the duplication formula 
Tat 1) 0а + 1) = УтГ а + 1)27 2" for the gamma function we can re- 
auce the Ath moment of V to 


Я “ T(n*1-p,-2k + 28) (n * 1 - 2k) 
ерип = і 
(9) EV nin Г(п+1- р, – 26) (п +1- 26+ 2h) 
4 п 
= (TIE (n +13 26) 
i-2 | k-1 


"d -B.—2k - pi-l 
| хп p,-2k-2h 11-х)? ar}. 
0 


Thus И is distributed as Г17. (Г, У, where the Y;, are independent. and 


У, has density B(y;n + 1 — p; — 2k, pj. 
In general, the duplication formula for the gamma function can be used to 
reduce the moments as indicated in Section 8.4. 


9.3.3. Some Special Distributions 


If q = 2, then V is distributed as О, p,,n-p, Special cases have been treated 
in Section 8.4, and references to the literature given. The distribution for 
р, = рз = ру = 1 is given in Problem 92, and for p, = p2 =ру = 2 in Problem 
9.3, Wilks (1935) gave the distributions for p, — p, = 1, for p, — p — 2," for 
р, = 1, р. =рз 72, for p, = 1, р = 2, py = 3, for ру = 1, р = 2, рз = 4, and 
for p, 7 p, = 2, рз = 3. Consul (1967a) treated the case p, = 2, р = 3, ps 
even. 

Wald and Brookner (1941) gave a method for deriving the distribution if 
not more than one p, is odd. It can be seen that the same result can be 
obtained by integration of products of beta functions after using the duplica- 
tion formula to reduce the moments. 

Mathai and Saxena (1973) gave the exact distribution for the general casc. 
Mathai and Katiyar (1979) gave exact significance points for p = X10 and 
n = 3(1)20 for significance levels of 5% and 1% (of —k log V of Section 9.4). 


tin Wilks's form ila T[ÍCN — 2 — В] should be Ги - 2- D). 
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9.4. AN ASYMPTOTIC EXPANSION OF THE DISTRIBUTION OF THE 
LIKELIHOOD RATIO CRITERION 


The Ath moment of A = V3" is 


ne r(stwa +h) - 4) 
(1) ON Kai Na thy TP 

TT {пг М +4) 7j 

where К is chosen so that «А0 = 1. This is of the form of (1) of Section 8.5 
with 


N -k oq 
а=р, b=p, х= 5, ё, = 2^" k-1, ‚Р, 
№ =] р, + ^^ tpi- 
(2) у= 5 р 2 , 


2 = р =1(] — 
Then f^ Чр(р + D- Epp; + Dl = 50р? - Epi B. = = 501 — p)N. І 
In order to make the second term in the expansion vanish we take p as 


xp! – Хр?) +9(р?- Хр). 


(3) p-i- 6N(p? - Хр?) 
Let 

3  p-Ep 
(4) k= pN=N~ 3 - SI Epl] 


Then w, = y,/k?, where [as shown by Box (1949)] 


2 
| p-Ep 5(р2- Ур?) (b-ZEB) 
ов 77 36 Е 


We obtain from Section 8.5 the following expansion: 


(6) Pr( -klogV <o} = Pr( xj <o} 


* a [Pr{ Хр ES 5} - Pr{ xf < | + O(k^). 
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Table 9.1 
eee 
Second 
p f v у, N k y,/K Term 
4 6 12.592 i 15 2 0.0033 — 0.0007 
5 10 18.307 5 15 2 0.0142 — 10.0021 
6 15 24.996 25 15 9 00393 — 0.0043 
16 2 0.0331 — 0.0036 


If q = 2, we obtain further terms in the expansion by using the results of 
Section 8.5. 


If p, — 1, we have 


f= ip(p-1), 
k-N- 2+1. 
(9 P( 


p-1) 
= 7g (2р? -2p-13), 


Уз Р) (р - 2)(2р ~ 1)(р+1); 


other terms are given by Вох (1949). If р; = 2 (р = 24) 


= 24(9 – 1), 
_ 4q+13 
(8) keN- “45>, 


-1 
n= HD (agi 84-7), 


Table 9.1 gives an indication of the order of approximation of (6) for 
р; ^ 1. In each case v is chosen so that the first term is 0.95. 


If д = 2, the approximate distributions given in Sections 8.5.3 and 8.5.4 are 
available. [See also Nagao (1973c).] 


9,5. OTHER CRITERIA 


In case д = 2, the criteria considered in Section 8.6 can be used with G + H 
replaced by A, and Н replaced by Ai А541, or С + Н replaced by A, 
and Н replaced by 4,4114. 
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The null hypothesis of independence is that £ — Ху = 0, where X, is 
defined in (6) of Section 9.2. An appropriate test procedure will reject the 
null hypothesis if the elements of А — A, are large compared to the elements 
of the diagonal blocks of Ay (where A, is composed of diagonal blocks 4j; 
and off-diagonal blocks of 0). Let the nonsingular matrix B, be such that 
Bj; A, B;; = І, that is, Аз! = B; Bj, and let Ву be the matrix with B,, as the 
ith diagonal block and 0’s as off-diagonal blocks. Then В, АВ, = І and 


9 ByApBy By Ay Boy 
B5 Aj Bi, 0 e Bs As, B. 
(1) Bo(A ~ 49) Bo = . , 2 009709 
ВА Bn Ва Ад? 2 UT 0 


This matrix is invariant with respect to transformations (24) of Section 9.2 
operating on A. A different choice of B, amounts to multiplying (1) on the 
lett by Q, and on the right by 0%, where О, is a matrix with orthogonal 
diagonal blocks and off-diagonal blocks of 0°. A test procedure should reject 
the null hypothesis if some measure of the numerical values of the elements 
of (1) is too large. The likelihood ratio criterion is the N/2 power of 
|B (A —A,)B) + I] = Во AB). a 

Another measure, suggested by Nagao (1973a), is 
(2) 4tr[ By(A — Ag) B] = 3&r[(4 — Ay) Ag? T = ИАА! 1) 

4 
=i У tr A5 ААА. 
p 

For 9=2 this measure is the average of the Bartlett-Nanda- Pillai trace 
criterion with G + Н replaced by Аз; and Н replaced by 4,454, and the 
same criterion with С + Н replaced by А») and Н replaced by 45 41/4. 

This criterion multiplied by n or N has a limiting x?-distribution with 


number of degrees of freedom f= 4(p?— ХУ p2), which is the same 


number as for —N logV. Nagao obtained an asymptotic expansion of the 
distribution: 


(3) Pr{3n tr( 44, – Dy <x} 
= Pr{ x7 х} 


1 
+ — 
n 


1 q q 
пзе Lot Рай es] 


i=l і= 1 
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1 q q q 5 5 
+3 (-20" +4p Dpi-2 DL pi - ph + Ур? РЦ xfs х) 
i=] 


i=l i-l 


1 q q 
+ Цер pte È pi |р, <x} 
i=) i=l 
1 3 3 2 2 2 -2 
-5g |2р EP +3p -3 E p? |Pr( xj x] | + O(n’). 


i=l 


9.6. STEP-DOWN PROCEDURES 


9.6.1. Step-down by Blocks 


Jt was shown in Section 9.3 that the N/2th root of the likelihood ratio 
criterion, namely V, is the product of 4-1 of these criteria, that is, 
V,,...,V,. The ith subcriterion V, provides a likelihood ratio test of the 
hypothesis H; [(2) of Section 9.3] that the ith subvector is independent of the 
preceding i — 1 subvectors. Under the null hypothesis H [= П.Н} these 
q — 1 criteria are independent (Theorem 9.3.2). A step-down testing proce- 
dure is to accept the null hypothesis if 


(1) V. > uei). i-2.....q. 


and reject the null hypothesis if V, «v(ej) for any i. Неге (в) is the 
number such that the probability of (1) when H; is true is 1— 2. The 
significance level of the procedure is = satisfying 


| 4 
(2) 1-e- Па-=). 


The subtests can be done sequentially, say, in the order 2,..., q. As soon as а 
subtest calls for rejection, the procedure is terminated; if no subtest leads to 
rejection, H is accepted. The ordering of the subvectors is at the discretion 
of the investigator as well as the ordering of the tests. 

Suppose, for example, that measurements on an individual are grouped 
into physiological measurements, measurements of intelligence, and mea- 
surements of emotional characteristics. One could test that intelligence is 
independent of physiology and then that emotions are independent of 
physiology and intelligence, or the order of these could be reversed. Alterna- 
tively, one could test that intelligence is independent of emotions and then 
that physiology is independent of these two aspects, or the order reversed. 
There is a third pair of procedures. 
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Other criteria for the linear hypothesis discussed in Section 8.6 can be 
used to test the component hypotheses H;,...,H, in a similar fashion. 
When H; is true, the criterion is distributed independently of X Oo XED, 
a= 1...., №, and hence independently of the criteria for H,..., Н;-1. 


9.6.2. Step-down by Components 


In Section 8.4.5 we discussed a componentwise step-down procedure for 
testing that a submatrix of regression coefficients was.a specified matrix. We 
adapt this procedure to test the null hypothesis H, cast in the form 


Xu X 7 Xy 
X X e XQ 

(3) Hy (Хи Yeo X121) г ? >, ' =0, 
Èi Xii, UT У 141 


where 0 is of order p, X p;. The matrix іп (3) consists of the coefficients of the 
regression of X? on (xO XG7 Dry, 

For і = 2, we test in sequence whether the regression of X,,,, on X Ф = 
(X,,..., X,)' is 0, whether the regression of X,,; on ХО is 0 in the 
regression of X, +: on X and X, ,,,..., азд whether the regression of 
Xp, +p, on X is 0 in the regression of X,, +p, on XO, X, b o Xpapier 
These hypotheses are equivalently that the first, second,..., and ро rows 
of the matrix in (3) for i = 2 are 0-vectors. 

Let AP be the k x k matrix in the upper left-hand corner of Ан, let AC 
consist of the upper К rows of 4;;, and let AC consist of the first k columns 
of A 


jo Km espe Then the criterion for testing that the first row of (3) is 0 
is - 
(4) 
Ay, Ay i-i B AQ 
Xas AP- (ama) i Do [pem 
А; 1л А; 1.11 AY, i 
Ay Ay i1 AQ 
А; 1 Aii i-1 AQ i 
ay AQ, Ap 
A, Ay i-i 
ay 
Ai-i 1 Ат 
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For k > 1, the criterion for testing that the kth row of the matrix in (3) is 0 is 
[see (8) in Section 8.4] 


(5) 
-1 
Ay A, i. AQ 
k . 
AP- (AP at.) 
A . k 
X= ruil Ai-i] at 1 
Ay Ay i-i Ak) 
k-1 k- - . 
Ag? - (A479 = at) : 
Aj.) Ai-i int ACH 
k 
2 Me 
7 k-1 
[Ag] 
Au UT Aii- AQ 


4,11 ... Ал А 


ili 


(k) ... k 
Ай AC a APP | а-о 


Au > А ЖРО м 


А, e k-1 
i~1,t Ai-i- ACA 
-D a k-1 k- 
An А) 4070 


= 2,...,р, = 2,...,9. 


Under the null hypothesis the criterion has the beta density B[x;3(n — p; +1 
—J),zp;]. For given i, the criteria X;,,..., X;, are independent (Theorem 
8.4.1). The sets for different i are independent by the argument in Section 
9.6.1. 

А step-down procedure consists of a sequence of tests based on 


Xue Кор» Ху»... › Xap, А particular component test leads to rejection if 


1-X;n-p,-41-i 
6 gol Pi J 
(6) X "m Bo > Е, n—p,+1-j(&j)- 
The significance level is e, where 


а Pi 


0) i-e Йа-а) 


i=2 j=l 


396 TESTING INDEPENDENCE OF SETS OF VARIATES 


The sequence of subvectors and the sequence of components within each 
subvector is at the discretion of the investigator. 


The criterion V, for testing H; is V; = ПРЕ, Ху, and criterion for the null 
hypothesis H is 


(8) V= Пи= П Хи. 


These are the random variables described in Theorem 933. 


9.7. AN EXAMPLE 


We take the following example from an industrial time study [Abruzzi 
(1950). The purpose of the study was to investigate the length of time taken 
by various operators in a garment factory to do several elements of a pressing 


operation. The entire pressing operation was divided into the following six 
elements: 


. Pick up and position garment. 

. Press and repress short dart. 

. Reposition garment on ironing board. 

. Press three-qyarters of length of long dart. 
. Press balance of long dart. 

. Hang garment on rack. 


OV tA RR UO м on 


In this case x, is the vector of measurements on individual a. The compo- 
nent x;, is the time taken to do the ith element of the operation. N is'76. 


The data (in seconds) are summarized in the sample mean vector and 
covariance matrix: - 


9.47 
25.56 
13.25 
31.44 |" 
27.29 

8.80 


(1) 


= 
ll 


2.57 0.85 1.56 1.79 1.33 0.42 
0.85 37.00 3.34 13.47 7.59 0.52 
(2) $= 1.56 3.34 844 5.77 2.00 0.50 
1.79 1347 5.77 34.01 10.50 1/17|’ 
1.33 7.59 2.00 10.50 23.01 3.43 
0.42 0.52 0.50 1.77 3.43 4.59 
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'(he sample standard deviations are (1.604, 6.041, 2.903, 5.832, 4.798, 2.141). 
The sample correlation matrix is 


1.000 0.088 0.334 0.191 0.173 0.123 
0.088 1.000 0.186 0.384 0.262 0.040 
0.334 0.186 1.000 0.343 0.144 0.080 
0.191 0.384 0.343 1.000 0.375 0.142 | 
0.173 70.262 0.144 0.375 1.000 0.334 
0.123 0.040 0.080 0.142 0.334 1.000 


The ‘investigators are interested in testing the hypothesis that the six 
variates are mutually independent. It often happens in time studies that a 
new operation is proposed in which the elements are combined in a different 
way; the new operation may use some of the elements several times and some 
elements may be omitted. If the times for the different elements in the 
operation for which data are available are independent, it may reasonably be 
assumed that they will be independent in a new operation. Then the 
distribution of time for the new operation can be estimated by using the 
means and variances of the individual items. 

In this problem the cr.terion V is V = |R| = 0.472. Since the sample size is 
large we can use asymptotic theory: k= 2, = 15, and —k log V = 54.1. 
Since the significance point for the x^-distribution with 15 degrees of 
freedom is 30.6 at the 0.01 significance level, we find the result significant. 
We reject the hypothesis of independence; we cannot consider the times of 
the elements independent. 


(3) R- 


9.8. THE CASE OF TWO SETS OF VARIATES 


In the case of two sets of variates (а=2), the random vector X, the 
observation vector x,, the mean vector p, and the covariance matrix X аге 
partitioned as follows: 


xo x 

X= gop te | yay? 
Q X. X 

Th 
p= „|, X = П 12 . 

p? Хи Xa 
The null hypothesis of independence specifies that X; = 0, that is, that & is 
of the form 


, zx, 0 
(2) = 9 Is] 


(1) 
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The test criterion is 


(Al 
(a V= aulas 
It was shown in Section 9.3 that when the null hypothesis is true, this 
criterion is distributed as О, р. м-1-р,» the criterion for testing a hypothesis 
about regression coefficients (Chapter 8). We now wish to study further the 
relationship between testing the hypothesis of independence of two sets and 
testing the hypothesis that regression of one set on the other is zero. ) 

The conditional distribution of X® given X = х0 is Nip” + ВС — 
ро), E2] = NBGP — FO) + v, Eh where В = EE Хи = 
х ХХХ, and v= p + Bx? — ро). Let X7 =X, Za = (хе - 
gO) 1| Bt = (B v), and E* = Z,,;. Then the conditional distribution of X* 
is N(B*z*, E*). This is exactly the distribution studied in Chapter 8. 

The null hypothesis that E; = 0 is equivalent to the null hypothesis В- 0. 
Considering x fixed, we know from Chapter 8 that the criterion (based on 
the likelihood ratio criterion) for testing this hypothesis is 

|E(xt - Въ) (5 — Bazt) | 
e Е В) Ву" 


ж(1)_ж(1) *(1)„ж(2), 
жа), ууж x") Eza Ta Eza Za 
= (Ext Za Хаа YO yg OunO. 


а а 


(7-1 


Аъ 9 
= (4n м) v] 
-(Au4g 0). 


The matrix in the denominator of U is 


(6) Y (x? -xU)(x -30y =A 


а= \ 
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The matrix in the numerator is 


N 
D E [x -3 4,42 (2? —2)] [29 — 2 -An AG (xD -2)] 
a=] А 
= 41; — Ay, АА. 
Therefore, 
(8) U= [An TAnAn Anl 14| 
14111 1411 14] , 


which is exactly V. 

Now let us see why it is that when the null hypothesis is true the 
distribution of U = V does not depend on whether the X? are held fixed. It 
was shown in Chapter 8 that when the null hypothesis is true the distribuon 
of О depends only on p, 41, and М-42, not on z,. Thus the conditional 
distribution of V given X? = x does not depend on x2; the joint distribu- 
tion of V and ХО is the product of the distribution of V and the distribution 
of ХО, and the marginal distribution of V is this conditional distribution. 
This shows that the distribution of V (under the null hypothesis) does not 
depend on whether the X® are fixed or have any distribution (normal or 
not). 

We can extend this result to show that if д> 2, the distribution of V 
under the null hypothesis of independence does not depend on the distribu- 
tion of one set of variates, say X2». We have V—V, ·-- V,, where V, is 
defined in (1) of Section 9.3. When the null hypothesis is true, V, is 
distributed independently of X®,..., X(4^ P? by the previous result. In turn 
we argue that V, is distributed independently of X(,..., XC- 9. Thus 
V; + V, is distributed independently of ХХ. 


Theorem 9.8.1. Under the null hypothesis of independence, the distribution 
of V is that given earlier in this chapter if q — 1 sets are jointly normally 
distributed, even though one set is not normally distributed. 


In the case of two sets of variates, we may be interested in a measure of 
association between the two sets which is a generalization of the correlation 
coefficient. The square of ће. correlation between two scalars X, and X, 
can be considered as the ratio of the variance of the regression of X, on X, 
to the variance of Ху; this is 7( BX,)/V(X,) = В, /о =(03/0)/o, 
= pi». A corresponding measure for vectors X? and ХО) is the ratio of the 
generalized variance of the regression of X? оп XO to the generalized 
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variance of ХХ, namely, 


(9) |eBxo(gx?y| |BX;B' |х, | 
р р IXul 
0 X, 
X4 X» 
-(-] my ti 
CD UIS 
If p, =p, the measure is 
(10) [X5 P 
IX [2l 


In a sense this measure shows how well X(? can be predicted from ХО). 

In the case of two scalar variables X, and X, the coefficient of alienation 
is 02/0, where of, = (X, — BX;) is the variance of X, about its 
regression on X, when ХХ, = £X, = 0 and &(X,|X,) = BX,. In the case of 
two vectors X and XO, the regression matrix is B-X,EZ, and the 
generalized variance of Х(? about its regression on ХО) is 


(11) 


| (ar - xoxo - вхоу)|- |Zu - x, 22 241 = 2 


IX» 
Since the generalized variance of XO is | X? XO"| 2 | Xl, the vector 
coefficient of alienation is 
(12) Ix - ХХ» — PA 
Ix [5111-15] | 


The sample equivalent of (12) is simply V. 

A measure of association is 1 minus the coefficient of alienation. Either of 
these two measures of association can be modified to take account of the 
number of components, In the first case, one can take the P,th root of (9); in 
the second case, one can subtract the p,th root of the coefficient of 
alienation from 1. Another measure of association is 


(2) Dy (DyQqa- - - 
tr é[Bxo(px?y](exoxo ^ 2 XoXzXaXy 
р р у 
This measure of association ranges between 0 and 1. If X? can be predicted 


exactly from XO for p, <р, (ie, Ху. = 0), then this measure is 1. If no 
linear combination of X? can be predicted exactly, this measure is 0. 


(13) 
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'9.9. ADMISSIBILITY OF THE LIKELIHOOD RATIO TEST 


The admissibility of the likelihood ratio test in the case of the 0-1 loss 
function can be proved by showing that it is the Bayes procedure with respect 
to an appropriate a priori distribution of the parameters. (See Section 5.6.) 


Theorem 9.9.1. The likelihood ratio test of the hypothesis that X. is of the 
form (6) of Section 9.2 is Bayes and admissible if N > p + 1. 


Proof. We shall show that the likelihood ratio test is equivalent to rejec- 
tion of the hypothesis when 


[fée 
— m 
[fort (49) 


с, 


(1) 


where x represents thc sample, Ө represents the parameters (м and X), 
f(x) is the density, and II, and П, are proportional to probability mea- 
sures of Ө under the alternative and null hypotheses, respectively. Specifi- 
cally, the left-hand side is to be proportional to the square root of TI7. |44 / 
lal. 

To define П,, let 


(2) p-(I-VWV Ww, X= (1+ ИИ"), 


where the p-component random vector V has the density proportional to 
(1 + v'v)^ ", n - N — 1, and the conditional distribution of Y given V = v is 
М0, (1 + v'v)/N]. Note that the integral of (1 + v'v)~ ?" is finite if n» p 
(Problem 5.15). The numerator of (1) is then 


(3) constf [им 
СЕ Y [s - (ey o] arem ss 6] 
a=! 


i [Ne 
(1+o'v) (1+0) exp тт dvdy. 
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The exponent in the integrand of (3) is —2 times 


№ N 
(4 Y x,(Itvv')x,—- ду Ух. + Ny?v'(I+ ww) v 


a-l. a=] 


Ny? 
+ 1+0'0 


N N 
= Уха чи У xax v — 2yv' № + №? 
а=1 


а=1 
=trd +v'Av + NP N(y- Xv), 


where A = EY x, x, — Nxx'. We have used v ww) v vv)! =1 
[from (1 v»)! =I- (0 + v'v) ' о"). Using W + vv'| = 1 + v'v (Corollary 
A.3.1), we write (3) as 


L ТА * d Lie -4 1 lI 
(3) conste” zA- PNE *f eL f е V '^* фр = const A| * e^?" 47 мя’. 
-x — 00 
To define TI, let X have the form of (6) of Section 9.2. Let 
(6) [i^ Ea] = [ue vov) "vex, (1+ voyon], i-1.5e 
where the p-component random vector У“? has density proportional to 
(1 + pO ph)", and the conditional distribution of Y; given VO = 0 із 


NIO, C1 + vv) /N], and let (V, Y), ..., (V, Yq) be mutually independent. 
Then the denominator of (1) is 


4 
(7) П сопѕ114,1 “Fexp| - (ur Ан + М9) 


i=l 


= const П lAl ') exp| - ¿(tr A + Nx'x)]. 


i=l 


The left-hand side of (1) is then proportional to the square root of 
пани lAl. " 


This proof has been adapted from that of Kiefer and Schwartz (1965). 


9.10. MONOTONICITY OF POWER FUNCTIONS OF TESTS OF 
INDEPENDENCE OF SETS 


Let Z, = [207,20], а= 1,..., n, be distributed according to 


0 XQ Xp 
(0 N 0"lZ4 22) 
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We want to test H: У, = 0. We suppose p, <p, without loss of generality. 
Let pj... Pp, (pz c7 z p,) be the (population) canonical correlation 
coefficients. (The p?'s are the characteristic roots of X1; X; X7; Ха, Chap- 
ter 12) Let В = diag( p,,..., p,,) and A=[R, 0] Cp, X p2). 


Lemma 9.10.1. There exist matrices B, (p, Xp), B; (p; X p;) such that 
(2) BXQBi-I,, B,X5B)-l,. В,У В, = А. 


Proof. Let m =р,, В = Bj, Е'= Xi Bj, Е = Z5 X5? in Lemma 8.10.13. 
Then F'F =B} nB; =1,,, В, B, =ВЕЕР=А. м 


(This lemma is also contained in Section 12.2.) 

Let x,=B,Z, y,- B,ZO, a=1,...,n, and X-(x,...,x,, Y= 
(yu... Ya). Then (x, у’), а= 1,...,п, are independently distributed ac- 
cording to 


0| (г A 
(3) N d 
ОРА’ I, 


The hypothesis H : X = 0 is equivalent to Н:А = 0 (i.e., all the canonical 
correlation coefficients p;,..., р», are zero). Now given Y, the vectors x,, 
а= 1,...,п, are conditionally independently distributed according to 
NCA Ja, I - АА) = N(A Ya, I В?). Then xz = (1, — R*)~ ix, is distributed 
according to N(My,, І, ) where 


M=(D,0), 
(4) D = diag( 8,,..., 8, ), 


Up 


8 = p,/ (1 = PÈ}, i=1,..., py. 


Note that à? is a characteristic root of ХХ УУ, where Хи. = X, 
T Zpiz Ха. 

Invariant tests depend only on the (sample) canonical correlation coeffi- 
cients r; = yc; , where 


(5) ci = Ах") aero "aae. 
Let 
S, -X*Y(YY) 'Yx*', 


(6) 
$,- X*x*' - s, =Х*[1- vo)" Y |X". 
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Then 


(7) | м(85:') = 1. 


Now given Y, the problem reduces to the MANOVA problem and we can 
apply Theorem 8.10.6 as follows. There is an orthogonal transformation 
(Section 8.3.3) that carries X* to (U,V) such that S, = UU', S,—YV', 
U—-(,...,u,,, V is p, X(n—p;), и; has the distribution Ме, Г), 
i — 1,..., p, Ce, being the ith column of Г), and N(0, D), i - p,  1,..., pos 
and the columns of V are independently distributed according to N(0, Г). 
Then сь...›6», are the characteristic roots of UU'(VV")^!, and their distri- 
bution depends on the characteristic roots of MYY'M', say, т?,...,т2. Now 
from Thcorem 8.10.6, we obtain the following lemma. 


Lemma 9.10.2. If the acceptance region of an invariant test is convex in 
each column of U, given V and the other columns of U, then the conditional 
power given Y increases in each characteristic root т? of MYY'M'. 


Lemma 9.10.3. А> В, then A,(A) > АКВ). 


Proof. By the minimax property of the characteristic roots [see, e.g., 
Courant and Hilbert (1953)], 


t 


. x'Ax . х’Вх 
(8) Aj A) = max min yx 2 max min r'x 


-A((B 
S; xes; S; xas СВ), 


where 5; ranges over i-dimensional subspaces. u 
Now Lemma 9.10.3 applied to MYY'M' shows that for every j, 7? is an 
increasing function of 5, = р; /(1 — р?) and hence of р;. Since the marginal 


distribution of Y does not depend on the pj's, by taking the unconditional 
power we obtain the following theorem. 


Theorem 9.10.1. An invariant test for which the acceptance region is convex 
in each column of U for each set of fixed V and other columns of U has a power 
function that is monotonically increasing in each ру. 


9.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


9.11.1. Observations Elliptically Contoured 
Let x;,..., xy be N observations on a random vector X with density 


(1) IAI ?g[(x-v)' A7 (x - v)], 
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where &R*«o; апі Е? = (x -v)'A (x—-v). Then &X=v and &(X— 
vXX — v) =%=(ER’/p)A. Let 


Q) s-ly 


= 


1х - - 
Xas S= aq b 3260.79) ^ 


а=1 


. Then 


(3) VNvec(S - X) 5 N[0, ( D (I; K,,)(X X) + « vec X(vec X)']. 


where 1+ k - p &R*/[Cp + 2X ERY]. | 

The likelihood ratio criterion for testing the null hypothesis У; = 0, i +, 
is the N/2th power of U = ПИ, where V; is the U-criterion for testing the 
null hypothesis £,; = 0,..., €; ,,; = 0 and is given by (1) and (6) of Section 
9.3. The form of V, is that of the likelihood ratio criterion U of Chapter 8 


with X replaced by ХХ, B by В, given by (5) of Section 9.3, 2 by 
xo 
(4) genet | 
x7» 
and X by X, under the null hypothesis B; = 0. The subvector XY) ds 


uncorrelated with "X, but not independent of X? unless (XY) ХОУ is 
normal. Let 


Ay Uo Ан 

(5) Anl] : : , 
Aiii CU Aini- 

(6) 40 = (Ап, Aii) = ди, 


with similar definitions of 50-0, $470, 50-0, and $675, We write 
V, = 1Gj /1G; + Hj, where 


(7) H, = 46100400) 4019 


= (N - 1) $e -»($6-n) 7060-19, 
(8) G, -A;-H, = (N - 1)5; — H;. 


Theorem 9.11.1. When X has the density (1) and the null hypothesis is true, 
the limiting distribution of Н, is ИА + кух, р where pj- p, toc +P,- 
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; G) 
and p, is the number of components of X`”. 
Proof. Since $7? = 0, we have 280-0 = 0 and 


K 1 
(9) ES Sim = (s * Wat) oem 


Ел! <p, and k, m » p; or if j,1>p,, and k, m € pj, and 655,5, = 0 other- 
wise (Theorem 3.6.1). We can write 


T. -— 1 PEN 
(10) & vec 8020 (veç SUID)! = (5 + vo (2. BED), 


Since 570 5, 0-0 and YN vec S7? has a iimiting normal distribution, 
Theorem 9.10.1 follows by (2) of Section 8.4. и 


Theorem 9.11.2. Under the conditions of Theorem 9.11.1 when the null 
hypothesis is true 


(11) -N logV, 5 (1+ к)х2,. 


Proof. We can write V, = |I + N^' C;G))! Hj| and use N log| + N^! C| = 
tr C +0,(N7') and 


1 -1 Bj.) Pi 
(12) “(с Ho-N У, X gis, 58" Sip 
ijep,tlg,h-l 


-1 


= N(vec sta) e $;! vec SLD, " 


Because X is uncorrelated with X^? when the null hypothesis 
H,: X67? — 0, V, is asymptotically independent of V,,...,V; ,. When the 


null hypotheses Нь,...,Н, are true, И, is asymptotically independent of 


Vas... Иа. И follows from Theorem 9.10.2 that 
d d 
(13) -NlogV- -N Y; log V; > Хр, 


where f= XP, рр; = 3L p(p + D ~ LL, рр: + D]. The likelihood ratio test 
of Section 9.2 can be carried out on an asymptotic basis. 
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Let Ay = diag(Aj;,...,4,,)- Then 


8 
(14) lu(A4;! - I) =} X t A; AS MAS 


has the x/-distribution when X = diag(3,,,..., X,,). The step-down proce- 
dure of Section 9.6.1 is also justified on an asymptotic basis. i 


9.11.2. Elliptically Contoured Matrix Distributions 


Let Y (p XN) have the density g(tr YY’). The matrix Y is vector-spherical; 
that is, vec Y is spherical and has the stochastic representation vec Y= 
R vec U, хм, where В? = (vec Y)' vec Y = tr YY’ and vec О ху has the uniform 
distribution on the unit sphere (vec О, „у )' vec И, „м = 1. (We use the nota- 
tion О, „у to distinguish Гот U uniform on the space UU’ = І). 

Let 


(15) X= vel, + CY, 


where A — CC' and C is lower triangular. Then X has the density 


(16) IAI" C7 (X - ve)(X* - eyv")(C')] Б 
= Ае (X' - =’) A"! (X — vey)]. 


Consider the null hypothesis Ў, = 0, ==}, or alternatively А, = 0, i sj, or 
alternatively, R;; = 0, i +j. Then C= diag(C,,,..., Са). 

Let М= I, - (1/N)e,£y; since М? = М, M is an idempotent matrix 
with N 1 characteristic roots 1 and one root 0. Then A = ХМХ' and 
Aj; = ХЭМХ". The likelihood function is 


(17) АГ" Zg(tr A7 [4 М v)(x-v)']). 
The matrix А and the vector x are sufficient statistics, and the likelihood 
ratio criterion for the hypothesis H is (|A|/T1#,14,,1)%”, the same as for 


normality. See Anderson and Fang (1990b). 


Theorem 9.11.3. Let f( X) be a vector-valued function of X ( p X N) such 
that 


8) f(X + vey) -f(X) 
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for all v and 
(19) СКК) -f(X) 


for all К = diag( K,,,..., Ка). Then the distribution of f (X), where X has the 


arbitrary density (16), 5 the same as the distribution of X wh те а. he 
normal density (16). ft ) €i X h st 


Proof. The proof is similar to the proof of Theorem 4.5.4. и 


It follows from Theorem 9.11.3 that V has the same distribution under the 
null hypothesis H when X has the density (16) and for X normally dis- 
tributed since V is invariant under the transformation X — KX. Similarly, V, 


and the criterion (14) are invariant, and hence have the distribution under 
normality. | 


PROBLEMS 
9.1. (Sec. 9.3) Prove 
gy» = Hiro «1-0 «40e (nigro *1-)]) 
IET 1-0] (nario «1-7 +A)} 
by integration of V^w(A| Xs, n). Hint: Show 


K(Xy,n) 


g 
h> - 
ёи = Kn tony) f H4 "w(A, Xy, n + 2h) dA, 


where K(Z,n) is defined by w(A| E, n) = K(X ka-p- p- i X-!4 
Theorem 7.3.5 to show у (2, AI е? . Use 


EV" = 


Kon) [KC +2h) 
K(Xo,n + 2h) I| Eom 


I J= for rm) аа, 
9.2. (Sec. 9.3) Prove that if p, =p, = p, = 1 [Wilks (1935)] 
Ри <o} = 1[3(n—1),3] + 28^ [3(n ^ 1), в Vi -v. 


[ Hint: Use Theorem 9.3.3 and Pr(V < v} = 1 - Pr(v x V).] 
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93. (Sec. 9.3) Prove that if p; = рз = рз = 2 [Wilks (1935)] 


9.4. 


9.5. 


9.6. 


9.7. 


РКИ < и} =I p(n- 5,4) 
+ B^ (n — 5,4079 (n/6 - 3(n - Do - $(n - 9v 
(Zn - 5) 
~3(n— 2)v log u — 1(п- 3)v7^ log v). 
[ Hint: Use (9).] 
(Sec. 9.3) Derive some of the distributions obtained by Wilks (1935) and 
referred to at the end of Section 9.3.3. ( Hint: In addition to the results for 


Problems 9.2 and 9.3, use those of Section 9.3.2.] 


(Sec. 9.4) For the case р; = 2, express k and y,. Compute the second term of 
(6) when v is chosen so that the first term is 0.95 for p= 4 and 6 and N= 15. 


(Sec. 9.5) Prove that if BAB' = CAC' =1 for A positive definite and B and C 
nonsingular then B — QC where Q is orthogonal. 


(Sec. 9.5) Prove N times (2) has a limiting x^-distribution with f degrees of 


. freedom under the null hypothesis. 


9.10. 


9.11. 


. (Sec. 9.8) Give the sample vector coefficient of alienation and the vector 


correlation coefficient. 


. (Sec. 9.8) If y is the sample vector coefficient of alienation and z the square 


of the vector correlation coefficient, find ук" when У = 0. 
(Sec. 9.9) Prove 


со oc 1 


du, du, < 2 


Lb. (1477 2)" 
if p«n. [ Hint: Let y7wylt ХР yè, fal, prin turn.] 


Let ху = arithmetic speed, x)= arithmetic power, x;- intellectual interest. 
x, = socal interest, x; = activity interest. Kelley (1928) observed the following 
correlations between batteries of tests identified as above, based on 109 pupils: 


1.0000 0.4249  —0.0552 —0.0031 0.1927 
0.4249 1.0000 -0.0416 0.0495 0.0687 
—0.0552 —0.0416 1.0000 0.7474 0.1691 
— 0.0031 0.0495 0.7474 1.0000 0.2653 
0.1927 0.0687 0.1691 0.2653 1.0000 
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Let x" = (уху) and x?" = (x3, ха, х). Test the hypothesis that x? is 
independent of x“ at the 1% significance level. 


9.12. Carry out the same exercise on the data in Problem 3.42. 


9.13. Another set of time-study data [Abruzzi (1950)] is summarized by the correla- 
tion matrix based on 188 observations: 


1.00  —0.27 0.06 0.07 0.02 
— 0.27 1.00 —0.01 -0.02 —0.02 
0.06 —0.01 100 -0.07 -0.04 
0.07 —0.02 -—0.07 100 -0.10 
0.02 -0.02 -0.04 -0.10 1.00 


Test the hypothesis that оу; = 0, i +j, at the 5% significance level. 


CHAPTER 10 


Testing Hypotheses of Equality 
of Covariance Matrices and 
Equality of Mean Vectors and 
Covariance Matrices 


10.1. INTRODUCTION 


In this chapter we study the problems of testing hypotheses of equality of 
covariance matrices and equality of both covariance matrices and mean 
vectors. In each case (except one) the problem and tests considered are 
multivariate generalizations of a univariate problem and test. Many of the 
tests are likelihood ratio tests or modifications of likelihood ratio tests. 
Invariance considerations lead to other test procedures. 

First, we consider equality of covariance matrices and equality of covari- 
ance matrices and mean vectors of several populations without specifying the 
common covariance matrix or the common covariance matrix and mean 
vector. The multivariate analysis of variance with random factors is consid- 
ered in this context. Later we treat the equality of a covariance matrix to a 
given matrix and also simultaneous equality of a covariance matrix to a given 
matrix and equality of a mean vector to a given vector. One other hypothesis 
considered, the equality of a covariance matrix to a given matrix except for a 
proportionality factor, has only a trivial corresponding univariate hypothesis. 

In each case the class of tests for a class of hypotheses leads to a 
confidence region. Families of simultaneous confidence intervals for covari- 
ances and for ratios of covariances are given. 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 
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I he application of the tests for ellipti y 
call О st ibutions IS 
JL contour ed di I 


10.2. CRITERIA FOR TESTING EQUALI RAL 
TY 
COVARIANCE MATRICES 2 OF SEVE 


In this section we study several normal distributions and consider using a set 


of samples, one from each population, to test the hypothesis that the 
covariance matrices of these populations are equal. Let x, а= 1 N, 
8 = 1,..., q, be an observation from the gth po i We We wish 
ulation NC i 
to test the hypothesis DP "My Ep We wish 


(1) H: = ... =X, 
Let Xii N, = М, 

N; 

2 A= (8) — ze) =(g)\r 

(2) в EG 24) (х® — sey, giles, 
q 

A= LA, 

871 


First we shall obtain the likelihood ratio criterion. The likelihood function is 


q N, 


G L-II-———-emp|-4 tz) 5- 
gel (отм  ? 5 PCI - н) z (xt? — y) |. 


ww 


The space О is the parameter space in which each У, is positive definite and 
p^ any vector. The space w is the parameter space in which X, = У, = -- 


— X, (positive definite) e» j 
L e) and p! is any vector. The maximum likeli 
estimators of м8) and X, in © are given by вы 


4 og) = go $ _ 1 
(9) Р-Я, Lea RAe 


The maximum likelihood estimators of <) іп о are given by (4), pi =x 

since the maximizing values of р) are the of 3 
sam 

e regardless of Х,. The 


function to be maximized with respect to X, = = = >. =, say, is 
1 1 q N, 
(5) ——>— exp| - 5 а) оуу B 
Ges mo DE CEOE P 2), 
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By Lemma 3.2.2, the maximizing value of È is 


(6) Ў = 94, 


‘and the maximum of the likelihood function is 


1 on м 
NON у 
(27) 12. 


(7) 


The likelihood ratio criterion for testing (1) is 


Па Bal 14. ИА:“ NEN 


(8 А, = A l T . TN.’ 
) 1 P iN lal ;N п, NY № 


The critical region is 


(9) à < ^(=), 


- where A,(e) is defined so that (9) holds with probability = when (1) is true. 


Bartlett (1937a) has suggested modifying A, in the univariate case by 
replacing sample numbers by the numbers of degrees of freedom of the A,. 
Except for a numerical constant, the statistic he proposes is 


пя 114,1" 
10 y = 5 
( ) 1 lal” 


where n, = N, 1 and n= Yjan,-N- 4. The numerator is proportional 
to a power of a weighted geometric mean of the sample generalized vari- 
ances, and the denominator is proportional to a power of the determinant of 
a weighted arithmetic mean of the sample covariance matrices. 


In the scalar case (р = 1) of two samples the criterion (10) is 


(msi 4 n,sd) tnm (nF + n)? 


where 5? and 52 are the usual unbiased estimators of ср and с> (the two 
population variances) and 


(12) Е= 25. 
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Thus the critical region 
(13) И =И(=) 


is based on the F-statistic with n, and n; degrees of freedom, and the 
inequality (13) implies a particular method of choosing Р(=) and F,(e) for 
the critical region 


(14) Е<Е (є), Е> Е,( е). 


Brown (1939) and Ѕсһе (1942) have shown that (14) yields an unbiased 
test. 

Bartlett gave a more intuitive argument for the use of V, in place of А. 
He argues that if М, say, is small, A, is given too much weight in A,, and 
other effects may be missed. Perlman (1980) has shown that the test based on 
V, is unbiased. 

If one assumes 


(15) EXP = Bz), 
where 58) consists of k, components, and if one estimates the matrix B,, 
defining 

N, 


(16) 4,7 Y (iP - Bi?) (x. Ban) 


a-l 


x 


one uses (10) with n, = №, — ky 

The statistical problem (parameter space © and null hypothesis w) is 
invariant with respect to changes of location within populations and a 
common linear transformation - 


(17) X* = СХ 4 pl), g71...4, 


where C is nonsingular. Each matrix 4, is invariant under change of 
location, and the modified criterion (10) is invariant: 


18 И = Ngala Па Са, ПЕ А _ и. 
BM cac e 


Similarly, the likelihood ratio criterion (8) is invariant. 
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An alternative invariant test procedure [Nagao (1973a)] is based on the 
criterion 


2 


q q 
(19) £2Xn,u(SS'!-I)-2Xnu(S,-S)S !(S,—S)S^', 
в=1 


where S, = (1/n,)A, and S = (1/n)A. (See Section 7.8.) 


10.3. CRITERIA FOR TESTING THAT SEVERAL NORMAL 
DISTRIBUTIONS ARE IDENTICAL 


In Section 8.8 we considered testing the equality of mean vectors when we 
assumed the covariance matrices were the same; that is, we tested 


(1) Hii = pO = eon po given X, -X, КА = ,. 


The test of the assumption іл Н, was considered in Section 10.2. Now let us 
consider the hypothesis that both means and covariances are the same; this is 
a combination of H, and H,. We test 


(2) Н:р0 = р = р), У 5, -X,. 


As in Section 10.2, let x’, о = 1,..., N,, be an observation from N(p®?, х,), 
&=1....,4. Then Q is the unrestricted parameter space of (n(£, X), g= 
1,...,9, where №, is positive definite, and w* consists of the space restricted 
by Q). 

The likelihood function is given by (3) of Section 10.2. The hypothesis Н, 
of Section 10.2 is that the parameter point falls in w; the hypothesis H, of 
Section 8.8 is that the parameter point falls in w* given it falls in wD w*; 
and the hypothesis H here is that the parameter point falls in w* given that 
it is іп О. 

We use the following lemma: 


Lemma 10.3.1. Let y be an observation vector on a random vector with 
density f(z,0), where Ө is a parameter vector in a space О. Let Н, be the 
hypothesis © E€ Q, CQ, let Н, be the hypothesis 6 E Q,, СО, given OED, 
and let H,, be the hypothesis Ө € Q,, given 9 € О. If A,, the likelihood ratio 


criterion for testing H,, à, for H,, and A, for H,, are uniquely defined for the 
observation vector y, then 


(3) Nap = А. Àp- 
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Proof. The lemma follows from the definitions: 


(4) л, = Boe n, f (9,8) 
, màX, c o (у, 8) , 
(5) А, = оеп, f (9,0) 
°  тахоса, (5,6) 
(6) hep = Dto en, 6) 
a” Maxgen f(y,0) ` 


Thus the likelihood ratio criterion for th i i 
hus | e hypothesis H is th 
the likelihood ratio criteria for H, and H;, P ® product of 


iN, 1 
(7) дл, = | TI м МР“ 
в-1 NP^ | | ВМ” 
where 
q №; 
(8) B- Y Y (хо) (х я) 
g-la-l 


q 
=А+ Y NG? -x) mox. 
g= 


The critical region is defined by 
(9) A<Xe), 


where ACe) is chosen so that the probability of (9) under H is ғ 
. 


(10) y, = AUT мин, 
Ip 7^ 


this 1$ equivalent to 2 T S 2 Б РА X. 
А fo testing H А which is A of 1 2 of Section 8 8 We 
( ) 


1 
ПА Ps 


(11) y yy, Ш 
|B|? 


However, Perlman (1980) has shown that the likelihood ratio test is unbiased. 
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10.4. DISTRIBUTIONS OF THE CRITERIA 


10.4.1. Characterization of the Distributions 


First let us consider V, given by (10) of Section 10.2. If 


14; +: +A ЕСЫ UC tae [А |" 

1 И. = i ook aH, =2...., 4, 
( ) 18 l4, oe $A, [Font tm? g q. 
then 

4 
(2) y, - HY». 


g-2 


Theorem 10.4.1. V,Vi,....Vi, defined by (1) are independent when 


XQ ... =}; and n, 2 p, Е=1...., 9. 


The theorem is a consequence of the following lemma: 


Lemma 10.41. If A and B are independently distributed according to 
W(X,m) and W(X, n), respectively, n 2 p, т 2р, and C is such that C(A + 
B)C' — I, then A * B and CAC' are independently distributed, A * B has the 
Wishart distribution with m + n degrees of freedom, and CAC' has the multivari- 
ate beta distribution with n and m degrees of freedom. 


Proof of Lemma. The density of D=A+B and E = CAC' is found by 
replacing А and B in their joint density by C! EC'^' and D- C^ ЕС’! = 
С-КТ-Е)С'-', respectively, and multiplying by the Jacobian, which is 
mod|c|~(?*? = 10| 2+7, to obtain | 


(3) K(X,m)K(X, n) C^! EC' |n? 
етт Eye tpm BEE P| po 
-K(X,m + пурт Р 1 ec BEEP 


1 
. МЕС + n] |g| eno Dr Е) 
T (3m)E, Gn) 


for D, E, and 1 Е positive definite. и 
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Proof of Theorem. If we let A, t +A, =D, and C,(A, + +++ +A,_)C, 
= E,, where C, D,C, =I, g —2,...,q, then 


_ ICG E С + U +" Cz (т EB, )Cy | 


(4) Vig =l lj dnt eng) 
IC; C, pit £ 
= [Е OLE Е». 8=2,..., а, 
and Е.,..., E, are independent by Lemma 10.4.1. и 


We shall now find a characterization of the distribution of V,,. A statistic 


Vi, is of the form 
IBI ICIS 
(5) btc 
[B 4 C| 


Let B, and С; be the upper left-hand square submatrices of B and C, 


respectively, of order i. Define 5, and c; by 
Bi bg Ci, бо 
(6) В, = ' ; С; = А i=2,...,p. 
bu, bu €i Си 
Then (5) is (By = Cy =I, by c, = 0) 
(7) 
ВСК _ П ВРС 18,1 +G l’ 


Ів + Ci^** i=! ВИС Ів, + CJ^** 
b _ c 
(bi — Bi Br bi) (ci ~ Cy Cit с) 
Р , _ b+c 
el [bi + си (В+) (Bii + Cia) (Ba + со) 


р b c 

- TI НИЙ 
h b+c 
fet | (biii + Cai) 


b+c 
(bici t и) 

-1 -1 
[bii сна t bu Bi ba + ei eu 


_ b+c 
7 (bi +e) (Bien +С.) (bo + св) 
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where Б: = b; — B; b, and сиу = си — Сас). The second term 
for i= 1 is defined as 1. 

Now we want to argue that the ratios on the right-hand side of (7) are 
statistically independent when B and C are independently distributed ac- 
cording to W(X, m) and W(X, n), respectively. It follows from Theorem 4.3.3 
that for B, , fixed b(; and bj;; , are independently distributed according to 
МВ 0;.,- Bi.) and oj; x? with m — (i — 1) degrees of freedom, re- 
spectively. Lemma 10.4.1 implies that the first term (which is a function of 
bj, 1 / C;.; 1) is independent of bj; , + с 

We apply the following lemma: 


iii- iii-l' 


Lemma 10.4.2. For B; , and C,. , positive definite 
‚ B- te- , -1 | 
(8) BB и + eC i60 (b + сь) (Bii ^ Ci) (Ba + 0) 
- - i p- -1 үтір - 
= (B boy - Ci, €) (Bi + Ci) (Bi. ba- Ci са). 


Proof Use of (B! + C)! = [С-В + OB ||! = ВСВ + CO) "С 
shows the left-hand side of (8) is (omitting i and i — 1) 
(9) 
ВВ C) (B! -C)bec(B^?! *C)(B C) Cc 
-(b-c)'B^!(B-! 4 C1) C^! (bc) 
-b'B^ (BC!) B-bxcC (B 4C) Cc 
—b'B(B^ C!) C'c-ce'C (B^! + С) BB, 
which is the right-hand side of (8). и 
The denominator of the ith second term in (7) is the numerator plus (8). 
The conditional distribution of ВБ – Сус is normal with mean 
ВЕ Ви - Сту and covariance matrix о, ,(B;.) + C74). The covari- 
ance matrix is o;,;., times the inverse of the second matrix on the right-hand 
side of (8). Thus (8) is distributed as o;;.,_, x? with i — 1 degrees of freedom, 


independent of B, ,, C; ,, b and Ciji 
Then 


(10) bis iC _ | bii. E Cui-l } 
(bj; че)“ bii, + Cii biis, + Cii 


i-1» "iH-i-1» 


is distributed as X?(1— XY, where X; has the g[(m — i + D,1(n - i - 1)] 
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distribution, i — 1,..., p. Also 

р... фс, bc 

(11) Шр 1 111—1 , і= 2,..., , 
В ci; + (8) P 
is distributed as Y,°*°, where У; has the B[3(m +n) — i + 1,1G — 1)] distribu- 


tion. Then (5) is distributed as TIZ, Х2(1 — X,)°T1f.,Y,°**, and the factors 
are mutually independent. 


Theorem 10.4.2. 


9 [P Ро 
па We Де me xe ie on, 


i=l i=2 


where the X’s and Y's are independent, X;, has the B[5(n, + = +n,_, —i+ 1), 
3(n, — i + 1)] distribution, and Yi, has the B[5(n, € ^ +п,) - i 1G - D] 
distribution. 


Proof. The factors Vi,...,V,, are independent by Theorem 10.4.1. Each 
term V;, is decomposed according to (7), and the factors are independent. 
и 


The factors of И, can be interpreted as test criteria for subhypotheses. 


(2-1 = of}, and the term depending on Y, is the criterion for testing 
0) —o() given o) , = of ,, and Z;.,, = E; ,,. The terms dependin 
on X,, and Y,, similarly furnish criteria for testing €, = 2, given $, = + = 
Ea 

Now consider the likelihood ratio criterion A given by (7) of Section 10.3 
for testing the hypothesis p? = -= = pw апе X, = + = X, It is equivalent 
to the criterion 


n 
rg. 14,12% 


13 a Ae 
( ) и |A, + e +A EO 7 +N,) 


lA, * А" 
а, & +A, + D4, N (28-2) (x — x)" 


The two factors of (13) are independent because the first factor is indepen- 


dent of A; +++ +A, (by Lemma 10.4.1 and the proof of Theorem 10.4.1) 
and of х,..., x(?. 
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Theorem 10.4.3 


P p 
: or tN) iN 
Ig ра, 
2 i- 


i= 


3 fE ама IN, 
(14) w= ПП“ "eo - X.) 


g=2 \i=1 


where the X’s, Y's, and Z's are independent, X;, has the Вт + Me ~ 
i* Dl, ~i + 1)] distribution, У has the Blin, + +n) 1+ 1,iG — 0] 
distribution, and Z; has the B[3(n + 1 — i),3(q — 1)] distribution. 


Proof. The characterization of the first factor in (13) corresponds to that 
of V, with the exponents of X;, and 1 – X;, modified by replacing maoy №. 
` L А . . m 

The second term in UP a and its characterization follows from Theore 


8.4.1. и 


10.4.2. Moments of the Distributions 


We now find the moments of V, and of W. Since 0 < V, < ! and 0 < W 52, 
the moments determine the distributions uniquely. The hth moment of 1 
we find from the characterization of the distribution in Theorem 10.4.2: 


p 
4 ’ Peh Ha toe mm 
ev? = П вх" * e. en oh -Xa g! [levi i "j 


i=l i=2 


P гуп + +п 1) +h) - 3G 1)] 
(n Г acit] 


rin (1+1) - 5G - тг и + ty) -i+ 1] 
| Г (п i+ Гро € т) € 7 i711] 


р Гут tn (+h) -it Пг (т to tng 7i 1)] | 
П T[i (m + e +n) -ivri|r[igu + +a) +h) - 102= 1)] 


t 


y 


2 | rn *1-0] me 
1 (ат Ш г (п, +1-2)] 


1 


Г, (зп) 9 г, (п, + n,)] l 
DEt] gi nn 
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The hth moment of W can be found from its representation in Theorem 
10.4.3. We have 


XN iNh 


p 
iN,h +. Jh 
ewt = П П exis m Neh] -Xp I1 eY; 1 +М, CUP n 
i=l 
P Г (и, +e tng + 1 - i) 4 3h(N, +... +N,_1)| 
ГИ nad ОГ (и, 1-70] 


T[s(n, +1 -i+ №), + tn) -i+ 1] 
трн oem) + BIN PN) +1-Й 


РГ (л+л) + (М E +N) +1 -i| 
T[in * +n) *1- i] 


гр и, +1 -i)| 
Ги een edo PD SAONE 7 +N,)| 
P гиг AN)]T[4QN - i)] 


TI Tli(n+1 -HITEN + AN - i] 


ria - ] 
T[iCN & &N - i)] 


ep [о Гм +AN, - i)] 
HL Boos] 


Г, (in) 4 Г, [4(n, + hN,)] 
Ги + fW): ГЫ) 


We summarize in the following theorem: 


Theorem 10.4.4. Let V, be the criterion defined by (10) of Section 10.2 for 
testing the hypothesis that H,: X, = © = Ў, where A, is n, times the samp 
covariance matrix and n, + 1 is the size of the sumple from the gth popu ation; e 
W be the criterion defined by (13) for testing the hypothesis H: p, = * = By 
and H,, where B = A + E, NX? — XY x'? — v)". The hth moment of V, when 

> Е Ы г H - n A 
Н is true is given by (15). The hth moment o, W, the criterion for testing H, is 
у й 


given by (16). 


This theorem was first proved by Wilks (1932). See Problem 10.5 for an 


alternative approach. 
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If p is even, say p = 2r, we can use the duplication formula for the gamma 
function iT(a + ICa + 1) = zT a + 127?*]. Then 


^L 1  T(n, + hn *1-2j) I(n*1-2j) 
(17) асат 


апа 


id q P(n, + AN, +1-2)) T(N-2j) 
(18) ew Щи co ron | 


In principle the distributions of the factors can be integrated to obtain the 
distributions of И, and W. In Section 10.6 we consider V, when p=2,g=2 
(the case of p=1,qg=2 being a function of an F-statistic). In other cases, 
the integrals become unmanageable. To find probabilities we use the asymp- 
totic expansion given in the next section. Box (1949) has given some other 
approximate distributions. 


10.4.3. Step-down Tests 


The characterizations of the distributions of the criteria in terms of indepen- 
dent factors suggests testing the hypotheses H, and H by testing component 
hypotheses sequentially. First, we consider testing H: Zi — X, for g=2. 


Let 
X(Py О Xf), off 
(19) x = oh m= oh = (в) w |? 
А и? o ud 


The conditional distribution of X(? given X® =x, is 


(20) N[ ii? +o S (P = uis) off], 


where 0105), = oj? — o EO o (5. It is assumed that the components of X 
have been numbered in descending order of importance. At the ith step the 
component hypothesis c? , = o2 , is tested at significance level & by 
means of an F-test based on 500, /50)_,; S, and S, are partitioned like X? 
and XO If that hypothesis is accepted, then the hypothesis o = a(2 (or 
IP of) = 30110) is tested at significance level 6, on the assumption 
that Xí/; = 0), (a hypothesis pr=viously accepted). The criterion is 
(8007150) — 8027150), (sor! + SOC!) (SP s — 52: 50) 


en (E= 1) Sii- , 
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where (n, +n, — 2i  2)s,; = (п —i+ Ds, Qn — i + 080). Under 
the null hypothesis (21) has the F-distribution with i — 1 and n, ^ n; – 21+ 2 
degrees of freedom. If this hypothesis is accepted, the (і + 1)8 мер is taken. 
The overall hypothesis €, = €, is accepted if the 2p — 1 component hy- 
potheses are accepted. (At the first step, o% is vacuous.) The overall 
significance level is 


р р 
(22) 1- Па-&)Па-&). 


If any component null hypothesis is rejected, the overall hypothesis is 
rejected. 

If q>2, the null hypotheses H,:X,— · = 5, is broken down into a 
sequence of hypotheses [1/(g — DK, + = +%,-1) = E, and tested sequen- 
tially. Each such matrix hypothesis is tested as X, = £, with S, replaced by 
S, and S; replaced by [1/(n, + + +n- ОКА + nn +А,-1). 

In the case of the hypothesis Н, consider first q = 2, 5, = €,, and 
p = p®. One can test €, = £,. The steps for testing x“? = pw consist of 
` t-tests for pf) = и? based on the conditional distribution of Xí? апа X 
given xf}, and х0? зу. Alternatively one can test in sequence the equality of 
the conditional distributions of X(P and X given x, and xf? p. 

For q > 2, the hypothesis E, = --- = X, can be tested, and then p; = 
=p; Alternatively, one can test vem ПКУ, + +5, 1) =E, and 
[1/(g — Кио +: + pls D) = = pe. 


10.5. ASYMPTOTIC EXPANSIONS OF THE DISTRIBUTIONS 
OF THE CRITERIA 


Again we make use of Theorem 8.5.1 to obtain asymptotic expansions of the 
distributions of И, and of A. We assume that n, — k,n, where L}_,k,=1. 

The asymptotic expansion is in terms of п increasing with Kk;,...,, fixed. 
(We could assume only lim n,/n =k, > 0.) 


The hth moment of 


h 
MANC 
А папе (n) 


П. FIAT [2n (1 +A) -ia-0] 
Пе T [2n(1 +h) + $(1— -j)] 


(2) e- 
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This is of the form of (1) of Section 8.6 with 


b-p, ym  "w-30-D. j=l, P, 
(3) a-pq, х= п, k-(g-Dptl...gn. 85L q 
&=3(1-0, КЕБр+Ь.....(а-Юр+ь il. p. 
Then | 
(4) f--21XY&- En- ila- b) 


-h£ 0-0- Ea-n-(- p) 


i=] 


ii 


-[-aip(p - 1) "EP - (q- 1)р] 
=3(q-1)p(p+1), 


в, = i0 – р), ј = 1,...,р, and В, = 11 — p)n, = (1 = kn, k - (6 — Dp 


+ 1,..., ар. 
In order to make the second term in the expansion vanish, we take p as 


21 2p *3p-1 
©) ЕЕ: зр 


Then 
none- Е] L 4 -i|-ss-na-er| 
= ШИНЕ ЗЕ — — — — g-i 5 - А 
(6) о) = 48 p? 
Thus 


(7) Pr{—2plog at <z} 
= Pr{ x? <z} + в [РЕ xf < z)- Pr{ x? xz z]l t O(n^) 
Let A = ИМЯ, М; №. The Ath moment is 


1 h 
Ir, GN)? ILIA [IN +4) - s] 
nig)" IW TENG +A) — i] 


(8 4л = К 
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This is the form (1) of Section 8.5 with 


q 
Nei LN, 
g=l 


n= – 3}, ]=1,....Р, 


wl 


Б =р, У; 


(9) а=рд, x, = ЗМ, к= (2- 1)р+1,..., 8р, 8 = 1,...,9, 


& = – 31. к= і,р+і,...,.(9-1)р+і = 1,..5,р. 


The basic number of degrees of freedom is f = 3p( p + 39 — 1). We use (11) 
of Section 8.5 with B, = (1 — p)x, and s; = (1 - p)y; To make о; = 0, we 
take 


2 1 — 1| 2p? 4*9p* 11 
qo) в=1-| р 8000999: 
Then | 
_ Р(Р+3) n xu +2)-6(1-p)} (q— 1)]. 


The asymptotic expansion of the distribution of —2p log A is 
(12) Pr( -2plogA xz) 
= Pr{ x7 <z} + w [Prf хра <z} - Pr{ Xj <2]| 4 O(n^?). 


Box (1949) considered the case of Af in considerable detail. In addition to 
this expansion he considered the use of (13) of Section 8.6. He also gave an 
- imation. 
É A an ‘example, we use one given by E. S. Pearson and Wilks (1933). The 
measurements are made on tensile strength (X,) and hardness (X,) of 
aluminum die castings. There are 12 observations in each of five samples. 
The observed sums of squares and cross-products in the five samples are 


_ { 78.948 21418) 
А: = | 21418 1247.18) 


4, = ( 223.695 89702) 

27 0657.62  251931/" 

_ { 57.448 19062) 

(13) 4:7|190.68 124178” 
A, = (187-618 37591) 

4 0375.91 1473.44)’ 

_ { 88.456 2548] 

As=\ 259.18 117173] 
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and the sum of these is 


_[ 636.165 1697.52 
(14) A= (163616 emu] 


The —log Af is 5.399, To use the asymptotic expansion we find p — 152/165 
= 0.9212 and w, = 0.0022. Since о, is small, we can consider —2p log AT as 
x? with 12 degrees of freedom. Our observed criterion, therefore, is clearly 
not significant. 

Table B.5 [due to Korin (1969)] gives 596 significance points for —21og AT 
for N; = = = №, for various q, small values of N,, and p = 2(1)6. 

The limiting distribution of the criterion (19) of Section 10.1 is also xj. An 
asymptotic expansion of the distribution was given by Nagao (1973b) to terms 
of order 1/n involving y?-distibutions with f, f+2, f+4, and f+6 
degrees of freedom. 


10.6. THE CASE OF TWO POPULATIONS 


10.6.1. Invariant Tests 


When q = 2, the null hypothesis H, is $, = €;. It is invariant with respect to 
transformations 


(1) xD = CODD, y*O = (0 4 yO 


where C is nonsingular. The maximal invariant of the parameters under the 
transformation of locations (С = Г) is the pair of covariance matrices №, £,, 
and the maximal invariant of the sufficient statistics x, 51, xO, S, is the 
pair of matrices S,,S, (or equivalently A,,A,). The transformation (1) 
induces the transformations $f = СУС’, £3 = СХ,С', Sf -—CS,C', and 
S% = CS,C'. The roots A, >A, 2 c >À, of 


are invariant under these transformations since 
(3) (Zt — АХ | = |СУ,С' - ACX,C'| = |СС'| 0X, — АХ. 


Moreover, the roots are the only invariants because there exists a nonsingular 
matrix C such that 
(4) CX,C-A,  CX;C'-IL 


where A is the diagonal matrix with A, as the ith diagonal element, 
і= 1,...,р. (See Theorem A.22 of the Appendix.) Similarly, the maximal 
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invariants of $; and S, are the roots /, >}, 2 °° >l, of 
(5) IS; – 18, = 0. 

Theorem 10.6.1. The maximal invariant of the parameters of N(y/?, X) 
and N(w, X.,) under the transformation (1) is the set of roots № = + 2 A, of 
(2). The maximal invariant of the sufficient statistics ¥, 51, #0), S, is the set of 


roots l > ++ zl, of (5). 


Any invariant test criterion can be expressed in terms of the roots 


lj, ...,1,. The criterion V, is піп?" times 
(6) I5, P18] 2" Е PARTI ina Е pn 
IS, nS і tna? jai (nl; n)? 


where L is the diagonal matrix with /; as the ith diagonal element. The null 


hypothesis is rejected if the smaller roots are too small or if the larger roots 
are too large, or both. 


The null hypothesis is that A, = + =A, = 1. Any useful invariant test of 
the null hypothesis has a rejection region in the space of Д,...,1, that 
includes the points that in some sense are far from l= =1,=1. The 


power of an invariant test depends on the parameters through the roots 
АА 
, 3 р’ 


The criterion (19) of Section 10.2 is (with nS = n,S, +8) 
1 =1]2 41 -1]2 
(7) тли [($1—-5)5-' 7+ ми [(s2-8)57'] 
= in tr[C(S, - s)e'(esc) T 


+, [ecs -secese ^] 
ay 
-in (6 - (Sr + Zum “21) | 
n n n n 
1 ni ny ni n; | 
t injtr {1- (Zr + AG 4 21) 


P (1-1 
лп BORD 
і1 (nl; n2) 


This criterion is a measure of how close J,,...,/, arc to 1; the hypothesis is 
rejected if the measure is too large. Under the null hypothesis, (7) has the 
x?-distribution with f= 4p(p + 1) degrees of freedom as n, oo, n; > oo, 
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and n,/n, approaches а positive constant. Nagao (1973b) gives an asymptotic 
expansion of this distribution to terms of order 1/n. 

Roy (1953) suggested a test based on the largest and smallest roots, /; and 
1. The procedure is to reject the null hypothesis if l, >k, or if lp < kps 
where k, and k, are chosen so that the probability of rejection when A = I 
is the desired significance level. Roy (1957) proposed determining k, and k, 
so that the test is locally unbiased, that is, that the power functions have a 
relative minimum at A = I. Since it is hard to determine k, and k, on this 
basis, other proposals have been made. The linit А, can be determined so 
that РИА > кН.) is one-half the significance level, or РИ, <k,|H,) is 
one-half of the significance level, or k, +k, = 2, or kk, = 1. In principle k, 
and k pcan be determined from the distribution of the roots, given in Section 
13.2. Schuurmann, Waikar, and Krishnaiah (1975) and Chu and Pillai (1979) 
give some exact values of k, and k, for small values of p. Chu and Pillai 
(1979) also make some power comparisons of several test procedures. 

In the case of p = 1 the only invariant of the sufficient statistics is 5,/55, 
which is the usual F-statistic with n, and и; degrees of freedom. The 
criterion V, is (A, Аз + ИА $; the critical region V, less than a 
constant is equivalent to a two-tailed critical region for the F-statistic. The 
quantity n(B—A)/A has an independent F-distribution with 1 and n de- 
grees of freedom. (See Section 10.3) 

In the case of p = 2, the Ath moment of V, is, from (15) of Section 10.4, 


T(n, + hn, — 1) (n; + hn; — Dr(n-1) 
í й — 1 1 2 2 
8) evi T(n, = 1) E(n - DP(n * hn - 1) 


охна хе), 


where X, ani X, аге independently distributed according to (ln, - 1. 
п; = Папа В(хіт, + n; - 2, 1), respectively. Then РКИ, < v) can be found by 
integration. (See Problems 10.8 and 10.9.) 

Anderson (1965а) has shown that a confidence interval for a/£,a/a'X;a 
for all а with confidence coefficient = is given by (1,/0, 1, /L), where 
Pr((n, —p * DL €n;F, n-p) Prin; Р + DF, рн. и, SMU) =1- 6. 


10.6.2. Components of Variance 


In Section 8.8 we considered what is equivalent to the one-way analysis of 
variance with fixed effects. We can write the model in the balanced case 
(N = № = °° = №) аѕ 


(9) xo = y+ us 


=pt+v, + UI, a=1,...,M, 8=1....,9. 
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LANG 
where 210) = 0 and UWUS! =F, v, = j£? — p, and p= ae AM. 
(X$.,v, = 0). The null hypothesis of no effect is v = = v= 0. 
pO = ü / M3, xi? and x = (1/4)Y.., X?) The analysis of variance table 


15 


Degrees of 
Source Sum of Squares Freedom . 
q v d 
Effect Н=М У (x? - x x? - x) q 
g-1 
Error G- у y (x68) — xe»? — xu q(M — 1) 
£71 a-1 
Y Y x) М- 1 
Total Y, Y Ot -3Xxt-—2x) q 
g=l a=l 


Invariant tests of the null hypothesis of no effect are based on the NP of 
1H — тб] = 0 or of |S, – 15,1 = 0, where 5, = [1/(9 – Du i e 7 
[1/q(M — 0С. The null hypothesis is rejected if one or more о, he The 
too large. The error matrix G has the distribution ee и D e 
effects matrix H has the distribution We, q- 1) when the M hypothesis Н 
true and has the noncentral Wishart distribution when the null hyp 


not true; its expected value is 


4 
(10) ёН= (4-1)х+М У; (pf? — p)(p? – в), 
g-l 


T 
= (4- )х+М У yyy. 
gal 


The MANOVA model with random effects is 


(10) X= pt V, + И, а=1,....М, 8=1....,9, 


istributi x) has the distribution | ] 
where V, has the distribution N(0,0). Then Xj " 


Níu, X. + Ө). The null hypothesis of no effect is 


(12) 9 - 9. 


ince X = ^ 
In this model С again has the distribution WZ, q(M — 1). Since X pc 


V.+U has the distribution N(p,(1/M 5 + ө), Н has the distribution 
WS + M9,q – 1). The null hypothesis (12) is equivalent to the equality 
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the covariance matrices in these two Wishart distributions; that is, X = Ӯ, + 
МӨ. The matrices С and H correspond to A, and A, in Section 10.6.1. 
However, here the alternative to the null hypothesis is that (E + МӨ) — X is 
positive semidefinite, rather than X,* X, The null hypothesis is to be 
rejected if H is too large relative to С. Any of the criteria presented in 
Section 10.2 can be used to test the null hypothesis here, and its distribution 
under the null hypothesis is the same as given there. 

The likelihood ratio criterion for testing Ө = 0 must take into account the 
fact that Ө is positive semidefinite; that is, the maximum likelihood estima- 
tors of У and 5 + МӨ under Q must be such that the estimator of 6 is 
positive semidefinite. Let /, > Г, > + > l, be the roots of 


1 
(13) H-I 6| - 0. 


(Note (1/[gCM — DIG and (1 /4)H maximize the likelihood without regard 
to O being positive definite.) Let Йе if L7 1, and let [* — 1 if 181. 
Then the likelihood ratio criterion for testing the hypothesis © = 0 against 
the alternative Ө positive semidefinite and Ө + 0 is 


pn k In 


= MUMK 
i-1 (Il; c M—- 1) 


р 
14 MM» 
(14) П (f + м – 1) 


> 


where К is the number of roots of (13) greater than 1. [See Anderson (1946b), 
(19842), (19892), Morris and Olkin (1964), and Klotz and Putter (1969).] 


10.7. TESTING THE HYPOTHESIS THAT A COVARIANCE MATRIX IS 
PROPORTIONAL TO A GIVEN MATRIX; THE SPHERICITY TEST 


10.7.1. The Hypothesis 


In many statistical analyses that are considered univariate, the assumption is 
made that a set of random variables are independent and have a common 
variance. In this section we consider a test of these assumptions based on 
repeated sets of observations. 

More precisely, we use a sample of p-component vectors X;,...,Xy from 
N(p,, X) to test the hypothesis H: X = o?I, where o? is not specified. The 
hypothesis can be given an algebraic interpretation in terms of the character- 
istic roots of X, that is, the roots of 


(1) IX – ФД = 0. 
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The hypothesis is true if and only if all the roots of (1) are equal.! Another 
way of putting it is that the arithmetic mean of roots фь..., Фр is equal to 
the geometric mean, that is, 


Hé, Is 
(2) ТРИ Ир | 


The lengths squared of the principal axes of the ellipsoids of constant density 
are proportional to the roots ¢; (see Chapter 11); the hypothesis specifies 
that these are equal, that is, that the ellipsoids are spheres. 

The hypothesis H is equivalent to the more general form v = a? Yo, with 
^, specified, having observation vectors y,,..., Ум from N(v, Y). Let C be 
a matrix such that 


(3) СС’ =], 


and let p* = Cv, X* = CWC', x = Cy,. Then xf,..., XN are Observations 
from N(p*, X*), and the hypothesis is transformed into Н: X* = o'I. 


10.7.2. The Criterion 


In the canonical form the hypothesis H is a combination of the hypothesis 
H,:X is diagonal or the components of X are independent and H,:the 
diagonal elements of £ are equal given that £ is diagonal or the variances of 
the components of X are equal given that the components are independent. 
Thus by Lemma 10.3.1 the likelihood ratio criterion А for H is the product of 
the criterion А, for Н, and A, for Н,. From Section 9.2 we see that the 
criterion for Н, is 


; | A[N П 
(4) л, = а = iri] aN 
where 
N 
(5) A= È (x,-xX)x.-X)- (41) 
a-1 


and rj; =а,/ yait- We use the results of Section 10.2 to obtain A, by 
considering the ith component of x, as the ath observation from the ith 
population. (p here is q in Section 10.2; N here is N, there; pN here is N 


t This follows from the fact that E = 0'0, where Ф is a diagonal matrix with roots as diagonal 
elements and О is an orthogonal matrix. 
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there.) Thus 


паре +" 


(6) à = —— or 

[E, (xi, ур" 

па 

(т/р). 

Thus the criterion for Н is 
iN 
(7) ASAA lal N 
(tr A/p)” 

It will be observed that A resembles (2). If /,,...,/, are the roots of 
(8) |S —U| — 0, 


where S = (1/n)A, the criterion is a power of the ratio of the geometric 


. mean to the arithmetic mean, 


nip» yv" 
(9) А= [555] . 


Now let us go back to the hypothesis W = о? Ч, given observation 
vectors ps Yu from N(v, W). In the transformed variables (x7) the 
criterion is |4*| "(tr 4* /p)- ?", where 


N 


(10) а= M(x-*'QGl-xy 


a=} 


=C E (n-J0.-»)C 


а= 1 
= СВС', 
where 
N 
(11) B= Y(x-0.-»). 


R 
П] 
= 
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From (3) we have Wy = C^ (C)! = (C'C) . Thus 


; |B| - 
А = тст = ВТ, 
ме тет = 8951 
(12) tr A* = tr CBC’ = tr BC'C 
=и BY.. 
The results can be summarized. 
Theorem 10.7.1. Given a set of p-component observation vectors Yis.» +» Yy 


from N(v, Y), the likelihood ratio criterion for testing the hypothesis H : = 
o? Y, where NV, is specified and о? is not specified, is 


-1)EN 
a3 A 
(tr BW, pY 

Mauchly (1940) gave this criterion and its moments under the null 
hypothesis. 

The maximum likelihood estimator of o? under the null hypothesis is 
tr BW, ! /CpN), which is tr А/(рМ) in canonical form; an unbiased estimator 
is ив! ром - D] or АД p(N — 1)] in canonical form (Hotelling 
(1951)]. Then tr BW, !/o? has the x7-distribution with p(N — 1) degrees of 
freedom. | 


10.7.3. The Distribution and Moments of the Criterion 


The distribution of the likelihood ratio criterion under the null hypothesis 
can be characterized by the facts that А = AA; and A, and A, are indepen- 
dent and by the characterizations of A, and А,. As was observed in Section 
7.6. when X is diagonal the correlation coefficients (rj are distributed 
independently of the variances {a;/(N — 1)}. Since A, depends only on {r;;} 
and A, depends only on (а;), they are independently distributed when the 
null hypothesis is true. Let W = A7 ^W, = А/М, W, = А/М. From Theorem 
9.3.3, we see that W, is distributed аз II^, X; where Х,..., Хр are 
independent and X, has the density glxl Hn -i+ D,3G— D], where n= 
N — 1. From Theorem 10.4.2 with W,=p?V//", we find that W, is dis- 
tributed as p'gvYi-'a – ү), where Y,,..., Y, are independent and Y, 
has the density (yl inCj — D, 3n). Then W is distributed as W,W,, where W, 
and W, are independent. 
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The moments of W can be found # i 
rom this ch izati 
Theorems 9.3.4 and 10.4.4. We have 5 Characenzaon or from 


(14) в» = Gn) бп) 
P^(jn-h) DL(in) ' 

(15) ew} = pte TEGN E h)T (zpr) 
Г?(2п)Г (рп + ph) ` 


It follows that . 


(16) gw = № TOP) | T Gn +h) 
FGpn ph) Tm) ` 


For р = 2 we have 


17 | à дһ T(n) 2 Tii(n+1-i 
(17) ew ^os I er rnt 
_ T(n)T(n-1-2h) n-1 


—T(n*2R)I(n-1) n-1-2k 


=(n- 1) ar dz, 


M use of the duplication formula for the gamma function. Thus W is 
istributed as Z^, where Z has the density (n — 1)2"7?, and W h 
density 5(n — 1)w379. The cdf is | as the 


(18) Pr(W <w} = F(w) = wit-V, 


This result can also be found from the joint distribution of Ll, the roots of 


(8). The density for p — А 
р = 3, 4, and 6 has been ob 
also Pillai and Nagarsenkar (1971). n obtained by Consul (1967b). See 


10.7.4. Asymptotic Expansion of the Distribution 


From (16) we see that the rth moment of W? = Z, say, is 


(19) £z = кр MT Land tr) + 1-1) 
T[ipn(1 + ғ)] 
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This is of the form of (1), Section 8.5, with 


— — 1l 
а= р, X= 3n, 


=1(1- 4 
(20) by 2(1 к), k 1,...,р, 


b=1, у= tmp, т = 0. 


Thus the expansion of Section 8.5 is valid with f= 1р(р + 1) ~ 1. То make 
the second term in the expansion zero we take p so 


2p’ +p+2 
21 -p= 22 PT 
(21) 1-p 6pn 
Then 
(22) o, - ép *2)(p - DGo = 2) (2p? + 6p? + 3p * 2) 
? 288 p? n'g? у 


Thus the cdf of W is found from 


(23) Pr( -2plogZ xz) 
= Pr( -np log W xz) РА 


= Pr{ x? <=} + ox(Pr( xj. <z} -Pr{ x? <а)) + O(n^?). 
Factors c(n, p, =) have been tabulated in Table B.6 such that 
(24) Pr(-nplogW «c(n, p, =) xijo «o-i(9)] = =. 
Nagarsenkar and Pillai (1973a) have tables for W. 


10.7.5. Invariant Tests 


The null hypothesis Н: € = o?I is invariant with respect to transformations 
X* =сОХ +v, where c is a scalar and О is an orthogonal matrix. The 
invariant of the sufficient statistic under shift of location is 4, the invariants 
of А under orthogonal transformations are the characteristic roots /,,...,/,, 
and the invariants of the roots under scale transformations are functions 
that are homogeneous of degree 0, such as the ratios of roots, say 
l /L, ...,1, ., /1,. Invariant tests are based on such functions; the likelihood 
ratio criterion is such a function. 
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Nagao (1973a) proposed the criterion 


П _uS,\ p_[o_ US$, p. 
(25) inu(s P 155 p DET 
21 Poe [| = Р - 
(55-1) P [css | 
: р EP (1-1) 
= 11Р. Er -p| =, 
' EE |F 


where і = E? ,l;/p. The left-hand side of (25) is based on the loss function 
L(X,G) of Section 7.8; the right-hand side shows it is proportional to the 
square of the coefficient of variation of the characteristic roots of the sample 
covariance matrix 5. Another criterion is /,//,. Percentage points have been 
given by Krishnaiah and Schuurmann (1974). 


10.7.6. Confidence Regions 


Given observations y,,..., ум from N(v, V), we can test Ҹ = c ^ f, for any 
specified W,. From this family of tests we can set up a confidence region for 
ху. If any matrix is in the confidence region, all multiples of it are. This kind 
of confidence region is of interest if all components of y, are measured in 
the same unit, but the investigator wants a region independent of this 
common unit. The confidence region of confidence 1— = consists of all 
matrices W* satisfying 


[Bw || 


(26) [rB] 


> Ае), 


where Ale) is the = significance level for the criterion. 
Consider the case of р=2. If the common unit of measurement is 


irrelevant, the investigator is interested in т = Jj / V»; and p= pi/ y Yua- 
In this case 


(27) ye 1 Vo» — py Uy os 
Puyal! - p) — py Ui bn фи 
1 1 — рут 


" 41 - p?) —píc T 
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The region in terms of 7 and p is 


4 (612 — buy T р?) 52 А/М ( =). 


28 
(28) (bi + Tba- 2p rbi) 


Hickman (1953) has given an example of such a confidence region. 


10.8. TESTING THE HYPOTHESIS THAT A COVARIANCE 
MATRIX IS EQUAL TO A GIVEN MATRIX 


10.8.1. The Criteria 


If Y is distributed according to N(v, W), we wish to test Н, that ve M 
where W, is a given positive definite matrix. By the MM Desk 
preceding section we see that this is equivalent to testing the P nost 
H,:X-I, where X is the covariance matrix of a vector X distr 


according to N(p, £). Given a sample x;,..., Хм, the likelihood ratio crite- 
rion is 

max, (№, I) 
(1) л = max, x Lp, =)” 


where the likelihood function is 


N 
(2) Las 3) = (27) p epi E (х, в) (xa в) |. 


а= 1 
Results in Chapter 3 show that 


С (от) " exp[ (ха 7 Gs 7 3)] 


e ^ Qm) PLN) AL PN ec iP" 
where 
(4) A= Y(x,-X)x,-X). 


Sugiura and Nagao (1968) have shown that the likelihood ratio test is biased, 
but the modified likelihood ratio test based on 


(5) at = (2) "lai in e jira CL e^(|S]e ^" 5)? , 
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where 5 = (1/n)A, is unbiased. Note that 
2 
(6) — 4108 AT = tr$ — loglS| -p =L,(I,S), 


where Г. (Г, S) is the loss function for estimating 7 by S defined in (2) of 


Section 7.8. In terms of the characteristic roots of S the criterion (6) is a 
constant plus 


р p р 

(7) 2 -lgI]l-p- У (l—ogl;- 1); 
i21 i=l i=l 

for each i the minimum of (7) is at J, = 1. 

Using the algebra of the preceding section, we see that given y,,..., yy as 
observation vectors of p components from N(v, W), the modified likelihood 
ratio criterion for testing the hypothesis H 11V = A, where Wy is specified, 
is 


ipn Lo » 
(8) xt = (=) [BA || eTit EY 
where 
N 
(9) B= Y(x-).—y»). 
a] 


10.8.2. The Distribution and Moments of the Modified Likelihood 
Ratio Criterion 


The null hypothesis H, : X —1 is the intersection of the null hypothesis of 
Section 10.7, H: X = о?Т, and the null hypothesis о? = 1 given $= 021. 
The likelihood ratio criterion for H, given by (3) is the product of (7) of 
Section 10.7 and 


(10) (Ed) emaren 


which is the likelihood ratio criterion for testing the hypothesis o? = 1 given 
У = о 21. The modified criterion Af is ће product of |A| 2" /(tr A /p)*" and 


trA ipn — ИГ А+ tpn. 
B Gm) erem 


these two factors are independent (Lemma 10.4.1). The characterization of 
the distribution of the modified criterion can be obtained from Section 
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10.7.3. The quantity tr А has the x^-distribution with np degrees of freedom 
under the null hypothesis. — 

Instead of obtaining the moments and characteristic function of А* [de- 
fined by (5)] from the preceding characterization, we shall find them by use 
of the fact that A has the distribution W(X, n). We shall calculate 


1 h 
apn П 1 
(12) а f pue] w( Al, n) dA 
п? 
еї"! h онча 
ж + Па en чан AIX, п) da. 
Since 
(13) 


| Al intnh-p-1) g Kur 4n hA) 
LA 


iA inh е ht Ay A|X,n = = = 
| (413,0) 2%" "T, (1n) 


- 2e"'T [1п(1+1])] 
[x + hil aata | у, | "T (1n) 


15. жди D ec 07104 
N zeta DT IE n(1 NS 
+ Ах" PT (1n) 


w(AI(X-! + Г)" nn), 


the hth moment of Aj is 


ep = (22) ixi np +h +17] 
и (28 


(14) Пи" Пу ra 1-7] 


Then the characteristic function of — 2108 A* is 


(15) ев = фт?" 


- (2) bd quie Dc 
(n роии" j P[iO-1-j) 
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When the null hypothesis is true, X = Г, and 


(16). 


Ee Zilog ML (22) "a _ 2i "р T[i(n -t1-j) - int] l 
" л Г (0+1) 


This characteristic function is the product of p terms such as 


Thus —21og Aj is distributed as the sum of p independent variates, the 
characteristic function of the jth being (17). Using Stirling’s approximation 
for the gamma function, we have 


(18 p(t) ~ 27 emini - 2i 9mm 
e Unio Din [ i(n +1 -j) _ int] o Dm 
ее +1 (п -j+ ype? 


iu(j- 1) yalana deje. 
i(n—jc1)(1-2ir) 


fi- 2j —] -int 
n(1.— 2it) ] 


As м 2o, $(t)> 0 - 2i Y, which is the characteristic function of Xj 
(x? with j degrees of freedom). Thus —21og AT is asymptotically distributed 


as LP; xj, which is x? with X^. j = 3p(p + 1) degrees of freedom. The 
distribution of Af can be further expanded [Korin (1968), Davis (1971)] as 


=(1= 28) [1 


(19) Pr( -2p log М 2 


where E 
_ 2p!-3p-1 
(20) P-7l-^eN(p-1) 
р(2р* + 6p) * p? – 12р - 13 
(21) y= РОР +бр +р 12р 13) 


288( p + 1) 
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Nagarsenker and Pillai (19736) found exact distributions and tabulated 5% 
and 1% significant points, as did Davis and Field (1971), for p = 2(1)10 and 
n = 6(1)30(5)50, 60, 120. Table B.7 [due to Korin (1968)] gives some 5% and 
1% significance points of —21og AT for small values of n and p = 2(1)10. 


10.8.3. Invariant Tests 


The null hypothesis H: X =1 is invariant with respect to transformations 
X* = ОХ+ v, where О is an orthogonal matrix. The invariants of the suffi- 
cient statistics are the characteristic roots /,,..., 1 А of S, and the invariants of 
the parameters are the characteristic roots of X. Invariant tests are based оп 
the roots of 5; the modified likelihood ratio criterion is one of them. Nagao 
(1973a) suggested the criterion 


(22) Inti(S- 1) = jn Ў (1-12. 
i=1 


Under the null hypothesis this criterion has a limiting ?-distribution with 
ip(p + 1) degrees of freedom. 

Roy (1957), Section 6.4, proposed a test based on the largest and smallest 
characteristic roots /, and /,: Reject the null hypothesis if 


(23) 1,<1 or h»u, 
where 
(24) Pr(I <l, l «ulX- 1] - 1-6 


and e is the significance level. Clemm, Krishnaiah, and Waikar (1973) give 
tables of и = 1/1. See also Schuurman and Waikar (1973). 


10.8.4. Confidence Bounds for Quadratic Forms 


The test procedure based on the smallest and largest characteristic roots can 
be inverted to give confidence bounds on qudaratic forms in X. Suppose nS 
has the distribution ИСХ, n). Let C be a nonsingular matrix such that 
¥=C’'C. Then nS* 2nC'^! SC^! has the distribution И, п). Since 1 < 
a'S*a/a'a <1 for all a, where Ір and I* are the smallest and largest 
characteristic roots of S* (Sections 11.2 and A.2), 


te 


a'S*a 
(25) Pri aa 


<u va 40) -1- 5, 


where 


(26) Prò] «I5 «It xu] =1-&. 
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Let a = Cb. Then a'a = b'C'Cb = p! , rs 
(25) is b'Xb and a'S*a = b'C'S* Cb = b'Sb. Thus 


27 Кри b'Sb 
Q7) 1 e=Pr{l< preg su vo +0} 
_ b'Sb , b'Sb. 
=pr( 22 <b'Sb 2290 vo}. 
Given an observed S, one can assert 
28 b'Sb , b'Sb 
( ) u У < Vb 


with confidence 1 — в. 


If b has 1 in the ith position and 0’s elsewhere, (28) is S/U < O; € $,/1. If 


b has 1 in the ith position, —1 i j ae 
then (28) is P у in the jth position, i #/, апа 05 elsewhere, 


Su t Sij 251 


(29) $a + 87 28% 


u — < 0: +0) — 20; < 7 
Manipulation of tnesc inequalities yields 
S;  Syt$5[] 7 
(30 2 - (7-2) Sy ни (1 1 
Г Ти] и 3 1-1). іж}, 


We can ootain simultan i i 
eously confidence intervals on 
all el 
From (27) we can obtain ements of 2. 


(31) 1-e- Po < 


wh 
M "A I, and /, are the largest and smallest characteristic roots of S and A 
р are the largest and smallest characteristic roots of €. Then i 


1 
(32) zl SAX) < Ц, 
is a confidence interval for all characteristic roots of X with confidence at 


least 1— ғ. In Section 11.6 ive ti 
ан 1-е .6 we give tighter bounds on A(X) with exact 
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10.9. TESTING THE HYPOTHESIS THAT A MEAN VECTOR AND 
A COVARIANCE MATRIX ARE EQUAL TO А GIVEN VECTOR 
AND MATRIX í . 


In Chapter 3 we pointed out that if W is known, (y — Vy) Wy (y — vo) is 
suitable for testing 
(1) Hyiv=V9, given Ф = Ф. 


Now let us combine H, of Section 10.8 and H,, and test 


(2) H:v=v, W=Wy, 

on the basis of a sample y,,..., yy from N(v, №). 
Let 

(3) | Х=С(У- v), 

where 

(4) СЧС! - I. 


Then x,,...,xy constitutes a sample from N(p, X), and the hypothesis is 
(5) H:p=0, E-I. 

The likelihood ratio criterion for H, : № = 0, given X = Г, is 

(6) A, =e PNE 


The likelihood ratio criterion for H is (by Lemma 10.3.1) 


1 


| pN 1 1 Nr 
(7) л= мА, = (5) АМ е 3" 4 е NEF 


А 
= (S) a g MINI) 


alt РМ iN p- Era, 
(м) ме 

The likelihood ratio test (rejecting H if А is less than a suitable constant) is 
unbiased [Srivastava and Khatri (1979), Theorem 10.4.5]. The two factors a, 
and A, are independent because A, is a function of A and A, is a function of 
X, and A and х are independent. Since 


(8) £X = Be INU = ge id Qe n) V, 
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the Ath moment of A is 


ipNh F,[3(n + Nh 

(9) ФА = x éM = (57) 1 xai VG + NO] 
(1+) 200 Din) 

under the null hypothesis. Then 

(10) —21og A= —21og A, — 2log A, 

has asymptotically the x?-distribution with /=р(р + 1)/2 +р degrees of 

freedom. In fact, an asymptotic expansion of the distribution [Davis (1971)] of 

—2plog À is 

(11) Pr(-2plogAxz) 


= Pr{ x? <z} + pen Pe xine <z} - Pr{ x? <z}) +O(N73), 


where 
Е 2р? + 9р ~ 11 
(2) p= 6N(p +3) ' 
204 + 18р? + 49р? + 36p - 1 
(13) „= РОР р? + 49р р 13). 


288( p – 3) 
Nagarsenker and Pillai (1974) used the moments to derive exact distributions 
and tabulated the 5% and 1% significance points for p — X1)6 and N= 


4(1)20(2)40(5)100. 
Now let us return to the observations y,,..., yy. Then 


(14) xx, = YQ. ~ v9) C'C( Ya = vo) 
= MO. Vo) PE Os vo) 


=tr4+Nx'x 
= (Вч) +) =v) We (y - vo) 


and 
(15) |A| = 181. 
Theorem 10.9.1. Given the p-component observation vectors y,,..., yy from 


N(v, V), the likelihood ratio criterion for testing the hypothesis H : v = vp. 
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№ = Wy, 5 
е М 1 П -1 (s wy- 
(16) A= (5) вч е В +N(F—v о) Vg G-v9l. 


N 


When the null hypothesis is truc, — 2106 A is asymptotically distributed as x? 
with ip( p + 1) + p degrees of freedom. 


10.10. ADMISSIBILITY OF TESTS 


We shall consider some Bayes solutions to the problem of testing the ` 


hypothesis 


(0 xoc 


4 


as in Section 10.2. Under the alternative hypott esis, let 
(5) = ү! (8) 1 = 
(0) [wi E] = [+ Cg) Cp (+ CC) |, 8=1....,4, 


where the pXr, matrix C, has density proportional to II + С, C n, 
п, = № = 1, the ғ,-сотропеп! random vector у“) has the conditional normal 
distribution with mean 0 and covariance matrix “им, - C, + 
CC) 6r given C,, and (Cy, y9),..., (€,, у) are independently dis- 
tributed. As we shall see, we need to choose suitable integers г,...,/,- Note 


i 


that the integral of |I + C,C,| ^ ?"s is finite if n, >p * rj. Then the numera- 
tor of the Bayes ratio is 


q x о 
(3) const I] f fae 


С 


оз} Y [хо - (1+ с, с.) с, У] 


a-l 


и и -1 
(1+ C,C,)| x — (1+ 6,6) cs} 


I- C(I + с,с,) "с, | 


Ape С.С 


-exp{- iny |i -C (1+ c,c;) ceo) dy? dC, 
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d со во N, 
= " 1| S 
const [T Г. Ер xt" (1 C, C)? 


а=1 


N, 
-2yevc, У xi мунун) dy'®) dC 
8 


4 № 
= const [ | exp -3 У xx) Г = Г 
8=1 . о — во 


а=1 


-exp{ - AN, (y _ Сухе) (yt - С, x) -itr C, A,0,) dy'? aC, 


| q 
= const e {-3 (8)! -} 
II xp( - [tr A, +N, xx] А |. 


Under the null hypothesis let 


(4) [u9, x, | = [ur cc)7' o9, (1+ cc), 


where the p xr matrix C has density proportional to |I + CC'| 7°", n= 
D» TL the r-component vector у‹8) has the conditional normal distrib d n 
with mean 0 and covariance matrix (1/N, I, — C'(1, + CC') ! C]! iven C, 
and y ^ сә УЧ аге conditionally independent. Note that the integral of 
[1+ CC'| 2" is finite if n 2 p +r. The denominator of the Bayes ratio is 


(5) 
eo co 9 | А 1 № 
|. HH + СС' № oo 3 У [x? - (1+ cc) ! oye 
а=] 


(1+ СС") |9 - (1+ осуи) 


1+ СС' п, 


1-с(1+СС')'с| 


exp( - IN, ye [и -C'(I- cc)''c]ye) z dC 
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со c 4 1 № 
= const f =f П ер -5| У х0 (1+ CC) 
-9 1-9g-l "leni 
№ 
—2у8'С' Y, x? + №, y (y? | roc 
а=1 
'N, 
1 4 E co со 
= constexp| — 5 2, У xt xt? EE 
| 2 £71 a=] Л. Г. 


q q 
СЕ > У №, (У) — С'х) (yt — Co — itr cac | dy® dC 
gal 8= 


1 1 21 
= имет - Se X Ns м м. 
8=1 


The Bayes test procedure is to reject the hypothesis if 


(6) Мыс. 
ПИ 


For invariance we want У; = ғ. 

The binding constraint on the choice of r,....r, is r, <л. 7р, 87 
1,...,q. lt is possible in some special cases to choose r;,...,7, SO that 
Cris- .,7,) is proportional to (№,..., №) and hence yield the likelihood ratio 
test or proportional to (n,,...,7,) and hence yield the modified likelihood 
ratio test, but since r,,...,7, have to be integers, it may not be possible to 
choose them in either such way. Next we consider an extension of this 
approach that involves the choice of numbers f,,...,f,, and t as well as 
г...» and Г. 

Suppose 20р – 1) <п,, &=1....,4, and take п 2р. Let t, be a real 
number such that 2p — 1 <r, +t, +p «n, + 1, and let ¢ be a real number 
such that 2p — 1 «r t - p <n + 1. Under the alternative hypothesis let the 
marginal density of С, be proportional to |C C, "+ C,C,| «s, g= 
1,...,g, and under the null hypothesis let the marginal density of C be 
proportional to |CC'| + CC'| *". (The conditions оп #,,...,f,, and t 
ensure that the purported densities have finite integrals; see Problem 10.18.) 
Then the Bayes procedure is to reject the null hypothesis if 


[А| (+0 


7 oer с. 
( ) ПА, eto 
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For invariance we want / = Eat If £,,..., 1, are taken so r, +1, = kN, and 
р-1<АМ, < N, — p, 8 = 1,...,9, for some К, then (7) is the likelihood ratio 
test; if r, +t, = kn, and p—1<kn,<n,+1—p, g= 1,...,q, for some К, 
then (7) is the modified test [i.e., (p — 1)/min, № <k <1—p/min, №1. 


Theorem 10.10.1. : If 2р<№ +1, 2 = 1,...,9, then the likelihood ratio 
test and the modified likelihood ratio test of the null hypothesis (1) are admissible. 


Now consider the hypothesis 
(8) p” = enm p", X, =- = у 


The alternative hypothesis has бесп treatcd before. For the null hypothesis 
let 


(9) [uo x,] = [0+ c0). 0 00) 1]. 


where the p Xr matrix C has the density proportional to |I + CC'l ND 
and the r-component vector y has the conditional normal distribution with 
mean 0 and covariance matrix (1/N[I — СІ + CC?) ' C]^! given C. Then 
the Bayes procedure is to reject the null hypothesis (8) if 


i 
НА 


gordy + Ega (9 7)" 5) 


10 І 
( ) TESE ARG 


If2p<N,+1, g= 1,...,9, the prior distribution can be modified as before 
to obtain the likelihood ratio test and modified likelihood ratio test. 


Theorem 10.10.2. If 2p« N,* 1, 8=1....,9, the likelihood ratio test 
and modified likelihood ratio test of the null hypothesis (8) are admissible. 


For more details see Kiefer and Schwartz (1965). 


10.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 

10.11.1. Observations Elliptically Contoured 

Let x£), а= 1,..., №, be №, observations on X'*) having the density 
(1) IA, Tg [x - v9) Az Gr »'?)]. 


where &[(X — v®)'AZ'(X—v©)P = ER, «oo, g = 1,..., q. Note that the 
same function g(-) is used for the density in all q populations. Define N, A,. 


450 TESTING HYPOTHESES OF EQUALITY OF COVARIANCE MATRICES 


g=1,....q, and A by (1) of Section 10.2. Let S,=(1/n,)A,, where п, = 
№ — 1, апа S = (1/п)А, where n = L7_,n,. l 

" Since the likelihood ratio criterion A, is invariant under the transforma- 
tion ХЧ = CX) + v), under the null hypothesis we can take Zac =z, 
=] and v? = -> =) = 0. Then 


q A ^ 
(2) -2loggA,- -| Y N, log! Egal =N оз 


= -| È N, log| Z + в, 01-ге (8.00 


Ву Theorem 3.6.2 
Ы Ti+ К ) + x vec Г, (vec 1,) |, 
Q) V/N,ve(S, - 1,) ^ М6, (к+ DU + Kop А 


and п,5, = № $ а, 8 = 1,...,9, are independent. Let № -k,N,g-— 1,..ф 
E k = 1 and let № >œ. In terms of this asymptotic theory the limiting 
distribution of vec(S, — D),...,vec(S,— I ) is the same as the distribution of 
yt, ..., 99) of Section 8.8, with Х of Section 8.8 replaced by (K + DG,: + 
K,,) + к vec I, (vec I)’. uU l 
"When X =1, the variance of the limiting distribution of YN, Gf? — D is 


Зк -2; the covariance of the limiting distribution of yN, (50) — 1) and 


"XT . : . (8) F 
ум (50? — D, іж, is к; the variance of s(P, i +j, is « + 1; the set УМС 
SUM > Ы . . . d 
— 1)...., YN GU! — 1) is independent of the set (57), i + j; and the sp, i<j, 1 
мазэ рр 


are mutually uncorrelated (as in Section 7.9.1). 
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Let ў, = уес( п — Г) and ў = ve($,, — I). Then ў = Lf_,(N,/N)5, and 


(4 -2lgg A3 Y: N(3,-3) (3, 5) 


са 
= | Y Та) 
871 


Let Q be a q X q orthogonal matrix with last column (/N,/N ,..., /N,/N y. 
Define : 


(5) Qro sm) m (VN Ji- -o VNp 5,)0. 


Then w, = /N y and 


. 9 9-1 
(6) Хм - №ӯ = Уу ww. 
gal g-1 
In these terms 
q-1 
(7) —2log A= È me + ONT), 
g- 


and w;,...,w, , are asymptotically independent, w, having the covariance 
matrix of УМУ, ; that is, (x + DI: +К,,) + к vec I, (vec I). Then ww, = 
УР j- (WPP = LPL (и) + 20; . (wi), The covariance matrix of 
wi... wt? is 2(к+ DI, + кєє, where = —(1,...,1)'. The characteristic 
roots of this matrix are 2(« + 1) of multiplicity p — 1 and a single root of 
Xxk--D-pk. Thus УР (и)? has the distribution of 2(x 4 0+ 
{2(« + 1) +рк] х2. The distribution of 2L; < (ИРУ is Xx + 1)x2 , y i. 


Theorem 10.11.1. When sampling from (1) and the null hypothesis is true, 
d 
(8 -2loga; > (K+ 1) X5 -nto-nt*272 + [(к+1) +рк/2] XLa- 
When к=0, -2logA, 5 x2.,,,,,1/; is in agreement with (12) of 
Section 10.5. The validity of the distributions derived in Section 10.4 depend 


on the observations being normally distributed; Theorem 10.11.1 shows that 
even the asymptotic theory depends on nonnormality. 
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The likelihood criteria for testing the null hypothesis (2) of Section 10.3 is 
the product A, A; or VV. Lemma 10.4.1 states that under norraality V, and 
V, (or equivalently А; and A,) are independent. In the elliptically contoured 
case we want to show that log V, and log V, are asymptotically independent. 


Lemma 10.11.1. Let A, =n,S, and A; =n,S, be defined by (2) of Section 


10.2 with 5, =}, =I. Then A(A, * A) ! and А, +A icai 
independent (A, +A, 1+4 are asymptoticaily 


Proof. Let (1y/n, XA, п, I) =W,, g = 1,2. Then 


(9) 
- n y 
yn [t *4j) - n Tx | = QUU HE WA = YT W +O (1) 
pee (n +n) (m +n) on 
(10) | 
n +n, |А +4, — = = — 
ym +n, [4 +4, (п € n;)1] ух, +=, "+ Y nin; W, * O,(1). 
Then 
(11) Ф vec —m ol _ Jaymin VW, 
(m +72) (n +m) 


ny n, ' 
ес ——— = 
| rmm W +y я. m) =0. " 


By application of Lemma 10.11.1 in succession to A, and 4; +А,, to 
A, + A; and A; +A, + Аз, etc., we establish that ААА. А“',..., А.А”! 
аге independent of A =A, + + +A,. It follows that И, and И, are asymptot- 
ically independent. 


Theorem 10.11.2. When У, = + — X, and pO) = + = р) 
(12) —2log МЛ, = —2log Л, — 2log A, 


а 
2 
> (к+ DxXwp-np272 + [(« +1) +рк/?] XLa + Ха. 
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The hypothesis of sphericity is that X — c? I (ог А = AI). The criteron is 
МА, where 


N/2 M 
14| / ПР. аи 
(13) А, = (пе | > 27 - Р 
EE 


£a; 
The first factor is the criteron for independence of the components of X, and 
the second is that the variances of the components are equal. For the first we 
set 4 —p and р; 1 in Theorem 9.10, and for the second we set q =p and 
p= 1. Thus 


(14) = log Ay Aa) S (1+ r) agen + (36+ 2x 


10.11.2. Elliptically Contoured Matrix Distributions 


Consider the density 
q A 4 ; 
(15) IIIA. Пн Y, AXE — ves (X9 — ves) 
8= $=1 
4 а а , 
-II Ind ae Ay 4,4 Y М, (x? - v?) agGe-vn). 
- g= g= 


In this density (4,, XQ) g= 1,...,9, is a sufficient set of statistics, and the 
likelihood ratio criterion is (8) of Section 10.2, the same as for normality 
[Anderson and Fang (1990b)]. 


Theorem 10.113. Let f(X) be a vector-valued function of X= 
(Х®,..., XP) (px №) such that 


(16) ПСС е X + еу) - f (X'9,..., X) 
for every (v0, ..., 0) and 
(17) f(CX'9,..., CX?) Sf (XP, XM) 


for every nonsingular C. Then the distribution of f(X) where X has the arbitrary 
density (15) with A, = --- = A, is the same as the distribution of f X) where X 
has the normal density (15). 
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The proof of Theorem 10.11.3 is similar to the proof of Theorem 4.5.4. 
The theorem implies that the distribution of the criterion V, of (10) of 
Section 10.2 when the density of X is (15) with A, = «+ = A, is the same as 
for normality. Hence the distributions and their asymptotic expansions are 
those discussed in Sections 10.4 and 10.5. 


Corollary 10.11.1. Let f (X) be a vector-valued function of X (p XN) such 
that 


(18) f(X+vey) =f(X) 
for every v and (17) holds. Then the distribution of f (X), where X has the 
arbitrary density (15) with A, = = =A, and yp) =... = (9), is the same as 


the distribution of f (X), where X has the normal density (15). 


1f follows that the distribution of the criterion A of (7) or V of (11) of 
Section 10.3 is the same for the density (15) as for X being normally 
distributed. 

Let X (рх №) have the density 


(19) AL" Zg[tr A (X -vey)(X- ve)']. 


Then the likelihood ratio criterion for testing the null hypothesis A = АЈ for 
some A>O is (7) of Section 10.7, and its distribution under the null 
hypothesis is the same as for X being normally distributed. 

For more detail see Anderson and Fang (1990b) and Fang and Zhang 
(1990). 


PROBLEMS 


10.1. (Sec. 10.2) Sums of squares and cross-products of deviations from the means 
of four measurements are given below (from Table 3.4). The populations are 
Iris versicolor (1), Iris setosa (2), and Iris virginica (3); each sample consists of 50 
observations: 


13.0552 4.1740 8.9620 2.7332 
А = 4.1740 4.8250 4.0500 2.0190 
! 8.9620 4.0500 10.8200 3.5820 |?’ 

2.7332 2.0190 3.5820 1.9162 


6.0882 4.8616 0.8014 0.5062 
4.8616 7.0408 0.5732 0.4556 


A:-|oa014 0.5732 1.4778 0.2974 |? 
0.5062 0.4556 0.2974 0.5442 
19.8128 45944 14.8612 2.4056 

q | 45944 50962 3.4976 2.3338 

Qc 


14.8612 3.4976 14.9248 2.3024 
2.4056 2.3338 2.3924 3.6962 


PROBLEMS 


10.2. (Sec. 10.2) 


(a) Let УФ, g= 


455 
(a) Test the hypothesis X, = £, at the 5% significance level. 
(b) Test the hypothesis X, = £, = X, at the 5% significance level. 
1,...,q, be a set of random vectors each with p components. 


Suppose 


€Y®=0, 


(yh) = 
фугу 26,2, 


Let С be an orthogonal matrix of order 4 such that each element of the 


last row is 


Define 


Show that 


if and only if 


Con = 1/Ja. 


q 
(8) — h 
29 = L с , 
h=1 


EZOZE = 0, 


Z=% 


> 


q' 


Use the result from (a) to construct a test of the hypothesis 


Н: = =, 


(b) Let X(9, а= 1,..., №, be a random sample from №), X 8=1...., 4. 


based on a test of independence of Z(? and the set Z™®,..., 79-0. Find 


the exact distribution of the criterion for the case p — 2. 


10.3. (Sec. 10.2) Unbiasedness of the modified likelihood ratio test of a? = a2. Show 
that (14) is unbiased. [ Hint: Let С =n, F/n,, г= a?/27, and c, <c, be the 


solutions to G?! (14 G) #14") к, the critical val ifi 
olut | , ue for th 
likelihood ratio criterion. Then e modified 


Pr(Acceptancel o/o? = r} = const сн 1 {1 + TG) gu n2) dG 
Mi 


= const f PHI + Hy) iamen) АН. 
1 


Show that the derivative of the above with respect to r is positive for 0 <r < 1 
0 for r = 1, and negative for r> 1.] | 
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10.4. (Sec. 10.2) Prove that the limiting distribution of (19) is ХР, where f= 
1р(р +109 —1. LHint Let Х = І. Show that the limiting distribution of 
(19) is the limiting distribution of 


ЕУ 


1 


КЄ 


р q 
X п. (5 — sy) + УУ ng(s =з), 
i-18g-1 


i<j g=1 


where S? = (509), S =(s,,), and the ул, GP — 8j), i <j, are independent in 
the limiting distribution, the limiting distribution of Ут, (sf) — 1) is N(0,2), 
and the limiting distribution of v/n,s(P i<j, is N(0,1).] 

10.5. (Sec. 10.4) Prove (15) by integration of Wishart densities. [Hint: ФУ! = 
ёп lAl ":|4| 3" can be written as the integral of a constant times 


|A|~ VIT wCA,Z, n, + hn,). Integration over £7_,A, =A gives a constant 
times w(A|%, п).] 


10.6. (Sec. 10.4) Prove (16) by integration of Wishart and normal densities. [ Hint: 
Y3., NGC? — xXx? — x. is distributed as 7421 ypy; Use the hint of Prob- 
lem 10.5.] 


10.7. (Sec. 10.6) Let x("),..., xV? be observations from N(w,2,), v — 1,2, and 
let A, = Ext? - x Kat? — BEY, 


(a) Prove that the likelihood ratio test for H:2£Z,-— X, is equivalent to 
rejecting H if 


141-14 
- Hub Mal cc 
|A, * A; 


(b) Let 42, d2,..., 42 be the roots of |Z, — AZ;| = 0, and let 


d. 0 -- 0 
0 d, = 0 
р=|. . 
она 


Show that T is distributed as |B,|-|B21/|B,+B,|°, where B, is dis- 
tributed according to W(D?, N — 1) and В, is distributed according to 
WU, N — 1). Show that Т is distributed as |DC, DI -| C4] /1DC,D + СЫ, 
where С; is distributed according to W(I, N — 1). 


10.8. (Sec. 10.6) For p = 2 show 
РКИ, xv) = (п, – 1,5 1) 
-1 m m (14 n5-2)/n b -2n fn -nn 
t B^ (n, = 1, n5 = 1)0 [ 3770-21) "^ d 


а 


+1 -h(n 7 1,n;— 1), 
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10.9. 


10.10. 


10.11. 


10.12. 


10.13. 


10.14. 


10.15. 


where a «b are the two roots of х1 х1) = ux npns/n". [Hint This 
follows from integrating the density defined by (8).] 


(Sec. 10.6) For p=2 and n, = n; = т, say, show 
РКИ, x v) 


PER 


=21(т-1т- 1) € 2B7 (m - tm — тии" log M. 
Í 1- Vr - 407" 


where а = [1 — V1 — 40/7]. 


(Sec. 10.7) Find the distribution of W for p = 2 under the null hypothesis (a) 
directly from the distribution of A and (b) from the distribution of the 
characteristic roots (Chapter 13). 


(Sec. 10.7) Let x,,..., ху be a sample from Ми, X). What is the likelihood 
ratio criterion for testing the hypothesis p = Аш, È = k? у. where py and X, 
are specified and k is unspecified? 


(Sec. 10.7) Let x(P,..., x&? be a sample from Мр, X ). and xO. x8? 
be a sample from N(w, Хз). What is the likelihood ratio criterion for testing 
the hypothesis that E, = k?X,, where k is unspecified? What is the likelihood 
ratio criterion for testing the hypothesis that p” =kp® and X, -KEX. 
where k is unspecified? 


(Sec. 10.7) Let x, of p components, а= 1,..., №, be observations from 
Му, X). We define the following hypotheses: 


H:p=0, X2 X4, 
НУ =k Èo, 


Hj:p. — 0, given that Х = ХХ а. 


In each case k? is unspecified, but X, is specified. Find the likelihood ratio 
criterion A, for testing H,. Give the asymptotic distribution of —2log a, 
under Н,. Obtain the exact distribution of a suitable monotonic function of A, 
under Н.. 


(Sec. 10.7) Find the likelihood ratio criterion A for testing H of Problem 
10.13 (given x,,..., xy). What is the asymptotic distribution of — 2log A under 
H? 


(Sec. 10.7) Show that А = АА, where A is defined in Problem 10.14. A; is 
defined in Problem 10.13, and A, is the likelihood ratio criterion for H, in 
Problem 10.13. Are A, and A; independently distributed under H? Prove your 
answer. 
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10.16. (бес. 10.7) Verify that Вч! has the x^-distribution with p(N — 1) de- 
grees of freedom. 


10.17. (Sec. 10.7.1) Admissibility of sphericity test. Prow that the likelihood ratio test 
of sphericity is admissible. [Hint: Under the null hypothesis let Х = [1/01 + 
„ZNI. and let т have the density (1 + 9) 2Р2)? 7 E] 


10.18. (Sec. 10.10.1) Show that fcr rz p 


1 
1 1 
zt r an 


1+ xx 
i=l 


r 
Па; < 2 
і=1 


і 


x х |v ; 
f -f Lxx 
-£ -æ| j1 


f2p-1<t+rtp<n+l. [Hint ПАТИ + А <1 if 4 is positive semidefi- 
a > - 

nite. Also, |Z/.,x;xj| has the distribution of xix2., 7o рж V Xie 

are independently distributed according to N(0, Г).] 


10.19. (Sec. 10.10.1) Show 


* 2 1 П M H _1 r 
f -f СС eint AC dC = const| Al 34+ Д 


where C is p Xr: [ Hint: CC’ has ће distribution W(A7',r) if C has a density 


1 . 
proportional to e7 "< 4С] 


10.20. (Sec. 10.10.1) Using Problem 10.18, complete the proof of Theorem 10.10.1. 


CHAPTER 11 


Principal Components 


11.1. INTRODUCTION 


Principal components are linear combinations of random or statistical vari- 
ables which have special properties in terms of variances. For example, the 
first principal component is the normalized linear combination (the sum of 
squares of the coefficients being one) with maximum variance. In effect, 
transforming the original vector variable to the vector of principal compo- 
nents amounts to a rotation of coordinate axes to a new coordinate system 
that has inherent statistical properties. This choosing of a coordinate system 
is to be contrasted with the many problems treated previously where the 
coordinate system is irrelevant. ' 

The principal components turn out to be the characteristic vectors of ‘ће 
covariance matrix. Thus the study of principal components can be considered 
as putting into statistical terms the usual developments of characteristic roots 
and vectors (for positive semidefinite matrices). 

From the point of view of statistical theory, the set of principal compo- 
nents yields a convenient set of coordinates, and the accompanying variances 
of the components characterize their statistical properties. In statistical 
practice, the method of principal components is used to find the linear 
combinations with large variance. In many exploratory studies the number of 
variables under consideration is too large to handle. Since it is the deviations 
in these studies that are of interest, a way of reducing the number of 
variables to be treated is to discard the linear combinations which have small 
variances and study only those with large variances. For example, a physical 
anthropologist may make dozens of measurements of lengths and breadths of 
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each of a number of individuals, such measurements as ear length, ear 
breadth, facial length, facial breadth, and so forth. He may be interested in 
describing and analyzing how individuals differ in these kinds of physiological 
characteristics. Eventually he will want to explain these differences, but first 
he wants to know what measurements or combinations of measurements 
show considerable variation; that is, which should have further study. The 
principal components give a new set of linearly combined measurements. It 
may be that most of the variation from individual to individual resides 
in three linear combinations; then the anthropologist can direct his study to 
these three quantities; the other linear combinations vary so little trom one 
person to the next that study of them will tell little of individual variation. 

Hotelling (1933), who developed many of these ideas, gave a rather 
thorough discussion. 

In Section 11.2 we define principal components in the population to have 
the properties described above; they define ап orthogonal transformation to 
a diagonal covariance matrix. The maximum likelihood estimators have 
similar properties in the sample (Section 11.3). A brief discussion of compu- 
tation is given in Section 11.4, and a numerical example is carried out in 
Section 11.5. Asymptotic distributions of the coefficients of the sample 
principal components and the sample variances are derived and applied to 
obtain large-sample tests and confidence intervals for individual parameters 
(Section 11.6); exact confidence bounds are found for the characteristic roots 


of a covariance matrix. In Section 11.7 we consider other tests of hypotheses 
about these roots. 


11.2. DEFINITION OF PRINCIPAL COMPONENTS 
IN THE POPULATION 


Suppose the random vector X of p components has the covariance matrix X. 
Since we shall be interested only in variances and covariances in this chapter, 
we shall assume that the mean vector is 0. Moreover, in developing the ideas 
and algebra here, the actual distribution of X is irrelevant except for the 
covariance matrix; however, if X is normally distributed, more meaning can 
be given to the principal components. 

In the following treatment we shall not use the usual theory of characteris- 
tic roots and vectors; as a matter of fact, that theory will be derived implicitly. 
The treatment will include the cases where X is singular (.е., positive 
semidefinite) and where X has multiple roots. 


Let В be a p-component column vector such that В'В = 1. The variance of 
ВХ is 


(1) &(B'X)’ = ép'xx'p- p'xp. 
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To determine the normalized linear combination B'X with mzximum vari- 
ance, we must find a vector В satisfying В'В = 1 which maximizes (1). Let 


(2) $= 3B-X'B-D = вов X8 -1). 
Hy 1 


where A is a Lagrange multiplier. The vector of partial derivatives (ap aBd 
is 

9ф _ _ 
(3) 3B^ 25В -2AB 
(by Theorem A.4.3 of the Appendix). Since В' В and B'B have derivatives 
everywhere in a region containing В'В = 1, a vector B maximizing 6'2B 
must satisfy the expression (3) set equal to 0; that is 


(4) (X-AID)B-90. 


In order to get a solution of (4) with В’В = 1 we must have X — Al singular; 
in other words, А must satisfy 


(5) |, — АД = 0. 


The function | X — АД is a polynomial іп А of degree p. Therefore (5) has p 
roots; let these be A, > А, > + >А,. [B' complex conjugate in (6) proves A 
real.] If we multiply (4) on the left by В’, we obtain 


(6) В'УВ=ЛВ'В ЕЛ. 


This shows that if В satisfies (4) (and В’В = D, then the variance of p'X 
[given by (1)] is A. Thus for the maximum variance we should use in (4) the 
largest root Аз. Let B? be a normalized solution of (X - А, ГВ = 0. Then 
U, = BOX is a normalized linear combination with maximum variance. [If 
X — Ау is of rank p — 1, then there is only one solution to (E — A, DB = 0 
and В'В = 1.] l l 

Now let us find a normalized combination В'Х that has maximum vart- 
ance of all linear combinations uncorrelated with U,. Lack of correlation 
means 


(1) 0= &B'XU, = &B'XX'BY = p'EBO = AB'BO 


since УВ = A,B“. Thus B’X is orthogonal to И in both the statistical sense 
(of lack of correlation) and the geometric sense (of the inner product of the 
vectors В and B® being zero). (That is, A,B'B? = 0 only if В'В' = 0 when 
№ #0, and A, #0 if X #0; the case of X = 0 is trivial and is not treated.) 
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We now want to maximize 

(8) Ф. = ВВ АВВ 1) —2»,B' ZB, 

where А and v, are Lagrange multiplers. The vector of partial derivatives is 


аф 


-2Xp-22Ap-2»Xp, 
and we set this equal to 0. From (9) we obtain by multiplying on the left by 
g^ 


(10) 0=2B EB -2ag9'B – 2», BO'EBO = 22А 


by (7). Therefore, v, = 0 and В must satisfy (4), and therefore A must satisfy 
(5). Let Ag) be the maximum of A,,...,A, such that there is a vector В 
satisfying (X — A,B = 0, В'В = 1, and (7); call this vector ВӘ and the 
corresponding linear combination U, = pex. Qt will be shown eventually 
that Ao; = Аз. We define Àn = ÀJ 


This procedure is continued; at the (r + Ist step, we want to find a vector. 


B such that В'Х has maximum variance of all normalized linear combina- 
tions which are uncorrelated with U,,...,U,, that is, such that 


(11) 0= &B'XU, = &B'XX'B = g'zg?- Aw BBO, i-1,..r. 
We want to maximize 
r R 
(12) Ф. =B'EB-ACB'B-1)-2 È В’ ВО, 
i=l 
where A and v,,...,v», are Lagrange multipliers. The vector of partial 


derivatives is 


(13) аба = 238-208 -2E » EB, 


and we set this equal to 0. Multiplying (13) on the left by BO”, we obtain 
(14) 0-28/'xg-22Ag7'g — 27,8" ZB. 
Нл, #0, this gives —2vyjÀ = 0 and v; = 0. If X5 = 0, then УВ = Ag BP 


= and the jth term in the sum in (13) vanishes Thus В must satisfy (4), and 
therefore A must satisfy (5). 
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Let А.р be the maximum of A,,...,A, such that there is a vector В 
satisfying (X — A4) DB = 0, ВВ’ — 1, and (11); call this vector pvt), and 
the corresponding linear combination U,,, = 6°*)'X. If А, ар — 0 and 
у= 0, j er 1, then BY’ EBC” = 0 does not imply В Bvt) = 0. How- 
ever, 8+0 can be replaced by a linear combination of B"*? and the ps 
with Ау being 0, so that the new В“*? is orthogonal to all go, j-1...r. 
This procedure is carried on until at the (т + i)st stage one cannot find a 
vector В satisfying В'В = 1, (4), and (11). Either т=р or m<p since 
gO,..., B0? must be 'inearly independent. 

We shall now show that the inequality т « p leads to a contradiction. If 
m «p there exist p —7 vectors, say е,,.1,:..,6,, Such that fe; =0, 


? p? 
еце; = ô; (This follows from Lemma А.4.2 іп the Appendix) Let 
(e, iss e,) = E. Now we shall show that there exists a ( p — m)-component 


vector e and a number 0 such that Ес = Ус;е; is a solution to (4) with A= Ө. 
Consider a root of |E'E E — ӨД = 0 and a corresponding vector c satisfying 
E'S Ес = Өс. The vector X Ec is orthogonal to B™,..., 9? (since BOX Ec 
= лов’ Усе, = ALe pe; = 0) and therefore is a vector in the space 
spanned by e,,,,...,e, and can be written as Eg [where g is a (р-т)- 
component vector] Multiplying ХУ Ес = Eg on the left by Е’, we obtain 
E'S Ec = E' Eg =g. Thus g = Өс, and we have X(Ec) = 0(Ec). Then (Ес) X 
is uncorrelated with BOX, j=1,...,m, and thus leads to a new 8+0. 
Since this contradicts the assumption that т < p, we must have m = p. 
Let В = (gf... B®) апа 


№ 0 eae 0 
0 Ag 0 

(15) A-|. © h 
0 0 A 


(p) 
The equations ХВ‘) = А,В can be written in matrix form as 
(16) XB-BA, 


and the equations ВВ = 1 and Bf? = 0, r 5, can be written as 


(7) ВВ-Г. 


From (16) and (17) we obtain 


(18) B'ZB- A. 
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From the fact that 


(19) IX - АИ = IBI -IZ — АЙ IpI 
= |B'ZB — АВ'ВІ 21A – all 
= П(А - А) 


we see that the roots of (19) are the diagonal elements of A; that is 


Ла) = А, Хоу = Ar "Om № р) = Ар. 


We have proved the following theorem: 


mms 11.2.1. Let the p-component random vector X have & X = 0 and 
= X. Then there exists an orthogonal linear transformation 


(20) U-B'X 


such that the covariance matrix of U is &UU' — ана 


à 0 0 
0 X 0 
(21) A= 
0 0 À 


where М>А2 > M Ар x 0 are the roots of (5). The rth column of В, В“), 
satisfies (X — A, DBO = 0. The rth component of U, Ц. = В" X, has maximum 
variance of all normalized linear combinations uncorrelated with U, U, 

aero Up] 


The vector U is defined as the vector of principal components of X. It will 
be observed that we have proved Theorem A.2.1 of Appendix A for B 
positive semidefinite, and indeed, the proof holds for any symmetric B. It 
might be noted that once the transformation to U,,...,U, has been made 
it is obvious that U, is the normalized linear combination with maximum 
variance, for if U* = Yc;U, where Lc? = 1 (U* also. being a normalized 
linear combination of the X’s), then Var(U*) = Xc24, =A, + УР c?(A, — А) 
(since cj =1-—L§c?), which is clearly maximum for c - 0, i -2 m 
Similarly, U, is the normalized linear combination uncorrelated with, U, 
which has maximum variance (U* = Y.c;U, being uncorrelated with U, impl 
ing су = 0); in turn the maximal properties of U,,..., И, are verified. Р 

Some other consequences сап be derived. B | 


СогоНагу 11.2.1. Suppose À,,— + =A,,, —v (Le, v is a root of 
multiplicity т); then X. — vI is of rank p — m. Furthermore B* = (p*? 


p^ ) is uniqu. ly determined exc ipli і 
+т ері for ultipl i 
" P multiplication on the right by an 
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Proof. From the derivation of the theorem we have (X — v)! = 0, 
i=r+1,...,r +m; that is, Вот, В" are m linearly independent 
solutions of (E — „ГВ = 0. To show that there cannot be another linearly 
independent solution, take EL, xp, where the х; are scalars. If it is a 
solution, we have „Ух, В = X(Qxjp) = Ex, LPO = LxA B®. Since vx; = 
Ах, we must have х; = 0 unless ¿=r + 1,...,r4 m. Thus the rank is p — т. 

If B* is one set of solutions to (Z — „ГВ = 0, then any other set of 
solutions are linear combinations of the others, that is, are B'A for A 
nonsingular. However, the orthogonality conditions B*'B* = I applied to the 
linear combinations give I = (BA) (B4) = A'B''B'A —4'A, and thus A 
must be orthogonal. и 


Theorem 11.2.2. An orthogonal transformation V = CX of a random vector 
X leaves invariant the generalized variance and the sum of the variances of the 
components. 


Proof. Let &X = 0 and @XX’ = X. Then €V=Oand ФУУ’ = CXC'. The 
generalized variance of V is 


(22) ICXC'| 21C] АСТ = СС = 121. 


which is the generalized variance of X. The sum of the variances of the 
components of V is 


Q3 Vév2=t(CEC') = t(ZC'C) = (BN) = X= LEX? и 


Corollary 11.2.2. The generalized variance of the vector of principal compo- 
nents is the generalized variance of the original vector, and the sum of the 
variances of the principal components is the sum of the variances of the original 
variates. 


Another approach to the above theory can be based on the surfaces of 
constant density of the normal distribution with mean vector 0 and covari- 
ance matrix X (nonsingular). The density is 


1 Lory -! 

(24) moe 7^7 
(2т) | 21° 

and surfaces of constant density are ellipsoids 


(25) x'E!x-C. 


A principal axis of this ellipsoid is defined as the line from —y to y, where y 
is a point on the ellipsoid where the squared distance x'x has a stationary 
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point. Using the method of Lagrange multipliers, we determine the stationary 
points by considering 


(26) у= х'х Ах'У іх, 


where A is a Lagrange multiplier. We differentiate i with respect to the 
components of x, and the derivatives set equal to 0 are 


(27) ŽE 2x- 2х6, 
or 
(28) x-AX x. 


Multiplication by X gives 
(29) Xx-Ax. 


This equation is the same as (4) and the same algebra can be developed. 
Thus the vectors B™®,..., B(? give the principal axis of the ellipsoid. The 
transformation и = B'x is a rotation of the coordinate axes so that the new 
axes are in the direction of the principal axes of the ellipsoid. In the new 
coordinates the ellipsoid is 


Thus the length of the ith principal axis is 2J/AC. 

A third approach to the same results is in terms of planes of closest fit 
[Pearson (1901). Consider a plane through the origin, а'х = 0, where @’a = 
1. The distance of a point x from this plane is а'х. Let us find the 
coefficients of a plane such that the expected distance squared of a random 
point X from the plane is a minimum, where @Х = 0 and &XX' = X. Thus 
we wish to minimize #(0'Х) = &a'XX'a=a'Za, subject to the restric- 
tion e'« = 1. Comparison with the first approach immediately shows that the 
solution is а = ^". 

Analysis into principal components is most suitable when all the compo- 
nents of X are measured in the same units. If they are not measured in the 
same units, *he rationale of maximizing В'> В relative to В'В is question- 
able; in fact, the analysis will depend on the various units of measurement. 
Suppose А is a diagonal matrix, and let Y = AX. For example, one compo- 
nent of X may be measured in inches and the corresponding component of Y 
may be measured in feet; another component of X may be in pounds and the 
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corresponding one of Y in ounces. The covariance matrix of Y is ФУ’ = 
ФАХХ'А = АХА = ФУ, say. Then analysis of Y into principal components 
involves maximizing &Cy'YY = y'W'y relative to y'y and leads to the 
equation 0 = (№ — vI)y = (АХА — vI)y, where v must satisfy |W — "Л = 
0. Multiplication on the left by A^! gives 


(31) 0= (X-vA?))(Av). 


Let Ay = a; that is, y'Y = y'AX = a'X. Then (31) results from maximizing 
&(a' XY = a'Xao relative to w’A~?a. This last quadratic form is a weighted 
sum of squares, the weights being the diagonal elements of A7?. 

It might be noted that if A^? is taken to be the matrix 


o, 0 = 0 
(32) aze 7 0 |, 
0 0 Opp 


then W is the matrix of correlations. 


113. MAXIMUM LIKELIHOOD ESTIMATORS OF THE PRINCIPAL 
COMPONENTS AND THEIR VARIANCES 


A primary problem of statistical inference in principal component analysis is 
to estimate the vectors B®, ..., BP? and the scalars A,,...,A,. We apply the 
algebra of the preceding section to an estimate of the covariance matrix. 


Theorem 11.3.1. Let xı... xy be N (>р) observations from N(p, X), 
where X. is a matrix with p different characteristic roots. Then a set of maximum 
likelihood estimators of A,,. .., A, and В®,..., BP? defined in Theorem 11.2.1 
consists of the roots k; > >К. of " 


(1) i$ - dl =0 

and a set of corresponding vectors b®, ..., БС) satisfying 
(2) (5 - x.1)59 = 6, 

(3) pp = 1, 


where $, is the maximum likelihood estimate of X. 
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Proof. When the roots of |X — АД = 0 are different, each vector B® is 
uniquely defined except that B® can be replaced by — B. If we require that 
the first nonzero component of B® be positive, then B is uniquely defined, 
and р, A, В is a single-valued function of p, X. By Corollary 3.2.1, the set of 
maximum likelihood estimates of р, A,B is the same function of ji, $. This 
function is defined by (1), (2), and (3) with the corresponding restriction that 
the first nonzero component of Б must be positive. [It can be shown that if 
[Zl #0, the probability is 1 that the roots of (1) are different, because the 
conditions on $ for the roots to have multiplicities higher than 1 determine a 
region in the space of Ê of dimensionality less than 1р(р + 1); see Okamoto 
(1973).] From (18) of Section 11.2 we see that 


(4) X-BAB'-YBopo, 
and by the same algebra 
(5) Ê = Y kp, 


Replacing Б by —b clearly does not change Ek,bOb", Since the 
likelihood function depends only on $ (see Section 3.2), the maximum of the 
likelihood function is attained by taking any set of solutions of (2) and (3). 

и 


It is possible to assume explicitly arbitrary multiplicities of roots of X. If 
these multiplicities are not all unity, the maximum likelihood estimates are 
not defined as in Theorem 11.3.1. [See Anderson (1963a).] As an example 
suppose that we assume that the equation |X — АД =0 has one root of 
multiplicity p. Let this root be A,. Then by Corollary 11.2.1, Z — A,I is of 
rank 0; that is, X—- А,1=0 or 5 = А. If X is distributed according to 
№, X) = Мф, Л, Г), the components of X are independently distributed 
with variance A,. Thus the maximum likelihood estimator of A, is 


PON ; 
(6) A= yy У Y (xi, - Xj), 


and $ = А, Г, and В can be any orthogonal matrix. It might be pointed out 
that in Section 10.7 we considered a test of the hypothesis that X = A,J (with 
A; unspecified), that is, the hypothesis is that У has one characteristic root of 
multiplicity p. | 
In most applications of principal component analysis it can be assumed 
that the roots of X are different. It might also be pointed out that in some 
uses of this method the algebra is applied to the matrix of correlation 
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coefficients rather than to the covariance matrix. In general this leads to 
ditferent roots and vectors. 


11.4. COMPUTATION OF THE MAXIMUM LIKELIHOOD 
ESTIMATES OF THE PRINCIPAL COMPONENTS 


There are several ways of computing the characteristic roots and characteris- 
tic vectors (principal components) of a matrix € or У. We shall indicate 
some of them. 


One method for small p involves expanding the determinantal equation 
(1) 0= [5 — АД 


end solving the resulting pth-degree equation in А (е.р., by Newton's method 
or the secant method) for the roots Л, > А> © >Ap. Then È- AI is of 
rank p — 1, and a solution of (X; — А; Г)ВО = 0 can be obtained by taking 8/^ 
as the cofactor of the element in the first (or any other fixed) column and jth 
tow of X — A. 

The second method iterates using the equation for a characteristic root 
and the corresponding characteristic vector 


(2) | Ух = Ах, 


where we have written the equation for the population. Let x, be any vector 
not orthogonal to the first characteristic vector, and define 


1 


3 Xi = E Ya- Ma 7 TERT Fa 
( ) (i) ух 


It can be shown (Problem 11.12) that 


i-0,1.2,.... 


im у. = +B lim x! x, = A2. 
(4) jim xo +BY, 2m Хоу 1 


The rate of convergence depends on the ratio A,/A,; the closer this ratio is 
to 1, the slower the convergence. 
To find the second root and vector define 


(5) 2, =X- a, BOBO. 

Then 

(6) У.В = xpo — A PPBO BP 
= ХВ = AB” 
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if i 1, and 
(7) 2B) = 0. 


Thus A, is the largest root of 5; and B® is the corresponding vector. The 
iteration process is now applied to È, to find A, and B®. Defining X, = 2; 
— A, POP”, we can find A, and B®, and so forth. 

There are several ways in which the labor of the iteration procedure may 
be reduced. One is to raise È to a power before proceeding with the 
iteration. Thus one can use X, defining 


Xu) 


yw = 7 d 
ух 


(8) Xn = Хуа» і= 0,1,2,.... 


This procedure will give twice as rapid convergence as the use of (3). Using 
Y: = XX! will lead to convergence four times as rapid, and so on. It should 
be noted that since £? is symmetric, there are only p(p + 0/2 elements to 
be found. 

Efficient computation, however, uses other methods. One method is the 
QR or QL algorithm. Let 2 = X. Define recursively the orthogonal О; and 
lower triangular L, by У, = 0,1; and X;,, -L;Q; (= 0;Х,0,), і= 1,2,.... 
(The Gram-Schmidt orthogonalization is a way of finding О, and L;; the QR 
method replaces a lower triangular matrix L by an upper triangular matrix 
R.) If the characteristic roots of Y, are distinct, lim; , ,E;,, = А*, where A* 
is the diagonal matrix with the roots usually ordered in ascending order. The 
characteristic vectors are the columns of lim; — s Q!Q' ,  Q (which is com- 
puted recursively). 

A more efficient algorithm (for the symmetric X) uses a sequence of 
Householder transformations to carry X to tridiagonal form. A Householder 
matrix is H=I—2aa' where а'а = 1. Such a matrix is orthogonal and 
symmetric. A Householder transformation of the symmetric matrix X is 
HH. it is symmetric and has the same characteristic roots as 5; its 
characteristic vectors are Н times those of X. 

A tridiagonal matrix is one with all entries 0 except on the main diagonal, 
the first superdiagonal, and the first subdiagonal. A sequence of p- 2 
Householder transformations carries the symmetric X to tridiagonal form. 
(The first one inserts 0’s into the last p — 2 entries of the first column and 
row of H XH, etc. See Problem 11.13.) 
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The QL method is applied to the tridiagonal form. At the ith step let the 
tridiagonal matrix be TẸ; let P{ be a block-diagonal matrix (Givens matrix) 


о 0 


I 0 

(9) PO = 0 cos 6; — sin 6; 0 
r J 1 ? 

0 sin 6; cos 6; 0 

0 0 0 I 


where cos 6; is the jth and j + Ist diagonal element; and let T® = PË; TË, 
]=1,....р- 1. Here 6; is chosen so that the element in position hi + i in T. 
is 0. Then PO = P(?Pf? ...РФ, is orthogonal and POTS? = КО is lower 
triangular. Then T(*9 = ROP! (= РОТОР) is symmetric and tridiago- 
nal. It converges to A* (if the roots are all different). For more details see 
Chapters П/2 and П/З of Wilkinson and Reinsch (1971), Chapter 5 of 
Wilkinson (1965), and Chapters 5, 7, and 8 of Golub and Van Loan (1989). | 
А sequence of one-sided Householder transformation (Н Х) can carry X to 
R (upper triengular), thus effecting the QR decomposition. 


11.5. AN EXAMPLE 


In Table 3.4 we presented three samples of observations on varieties of iris 
[Fisher (1936)]; as an example of principal component analysis we use one of 
those samples, namely Iris versicolor. There are 50 observations (N = 50, 


‚ n9 N —1 = 49). Each observation consists of four measurements on a plant: 


x, is sepal length, x, is sepal width, x, is petal length, and x, is petal width. 


The observed sums of squares and cross products of deviations from means 
are 


А 13.0552 4.1740 8.9620 2.7332 
() A= Ў (x,-3),-x) =| $1740 48250 40500 2.0190 
2 X ) 8.9620 4.0500 10.8200 3.5820 | 

2.7332 2.0190 . 3.5820 1.9162 


and an estimate of X is 


0.266433 0.085184 0.182899 0.055780 
2) ѕ= 5а = | 0.085184 0.098469 0.082653 0.041204 
0.182899 0.082653 0.220816 0.073102 |’ 
0.055780 0.041204 0.073102 0.039106 
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We use the iterative procedure to find the first principal component, by 
computing in turn z“ = 200-0. As an initial approximation, we use z% = 
(1, 0, 1,0). It is not necessary to normalize the vector at each iteration; but to 
compare successive vectors, we compute z(?/z(/7? = KO), each of which is an 
approximation to /;, the largest root of S. After seven iterations, г) agree to 
within two units in the fifth decimal place (fifth significant figure). This 
vector is normalized, and S is applied to the normalized vector. The ratios, 
rf?, agree to within two units in the sixth place; the value of /, is (nearly 
accurate to the sixth place) /, — 0.487875. The normalized eighth iterated 
vector is our estimate of B®, namely, 


| 0.686 7244 

(3). pt) = | 03053463 
0.6236628 | 
0.214983 7 


This vector agrees with the normalized seventh iterate to about one unit in 
the sixth place. It should be pointed out that /, and РХ have to be calculated 
more accurately than J, and b®, and so forth. The trace of S is 0.624 824, 
which is the sum of the roots. Thus /, is more than three times the sum of 
the other roots. 

We next compute 


(4) S =S- b | 
0.0363559 —0.0171179  —0.0260502 —0.0162472 


— | —0.0171179 0.0529813 —0.0102546 0.009 1777 
—0.0260502 —0.0102546 0.031 0544 0.007 6890 |’ 
—0.016 2472 0.009 1777 0.007 6890 0.016 5574 


and iterate (7 = 5,5070, using z’ = (0, 1,0,0). (In the actual computation 
S, was multiplied by 10 and the first row and column were multiplied by — 1.) 
In this case the iteration does not proceed as rapidly; as will be seen, the 
ratio of /; to J, is approximately 1.32. On the last iteration, the ratios agree 


to within four units in the fifth significant figure. We obtain /, = 0.072 3828 
and 


— 0.669 033 
@ | 0.567484 

(5) b 0.343309 | 
0.335 307 


The third principal component is found from 5, = S, — L, b? 5O", and the 
fourth from S, = S, — 1,5 Op, 
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The results may be summarized as follows: 


(6 (ll, l}, 14) = (0.4879, 0.0724, 0.0548, 0.0098), 


0.6867 —0.6690 | —0.2651 0.1023 

(n B- 0.3053 0.5675 | —0.7296 | —0.2289 
d 0.6237 0.3433 0.6272 —0.3160 |" 

0.2150 0.3353 0.0637 0.9150 


The sum of the four roots is У}, ,/; = 0.6249, compared with the trace of the 
sample covariance matrix, tr $ = 0.624 824. The first accounts for 78% of the 
total variance in the four measurements; the last accounts for a little more 
than 1%. In fact, the variance of 0.7x, + 0.3x, + 0.6x, + 0.2x, (an approxi- 
mation to thc first principal component) is 0.478, which is almost 77% of the 
total variance. If one is interested in studying the variations in conditions that 
lead to variations of (x,, хо, x5, x4), one can look for variations in conditions 
that lead to variations of 0.7x, + 0.3x, + 0.6x, + 0.2x,. It is not very impor- 
tant if the other variations in (ху, x5, X3, x,) are neglected in exploratory 
investigations. 


11.6. STATISTICAL INFERENCE 


11.6.1. Asymptotic Distributions 


In Section 13.3 we shall derive the exact distribution of the sample character- 
istic roots and vectors when the population covariance matrix is Г or 
proportional to Г, that is, in the case of al! population roots equal. The exact 
distribution of roots and vectors when the population roots are not all equal 
involves a multiply infinite series of zonal polynomials; that development is 
beyond the scope of this book. [See Muirhead (1982)] We derive the 
asymptotic distribution of the roots and vectors when the population roots 
are all different (Theorem 13.5.1) and also when one root is multiple 
(Theorem 13.5.2). Since it can usually be assumed that the population roots 
are different unless there is information to the contrary, we summarize here 
Theorem 13.5.1. 

As earlier, let the characteristic roots of € be A>: >A, and the 
corresponding characteristic vectors be В‹),..., В“), normalized so p'?'p'? 
= 1 and satisfying ,; z 0, i — 1,..., p. Let the roots and vectors of S be 
lj» >l, and b®,...,b normalized so bU? = 1 and satisfying Бу, > 0, 


i=1,...,p. Let dj = n (I; — A) and g = /n (b? — В), i — 1,.... p. Then 
in the limiting normal distribution the sets d,,...,d, and g,..., 2) are 


independent and d,,...,d, are mutually independent. The element d, has 
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the limiting distribution №(0,2 А7). The covariances of g®,...,g\” in the 
limiting distribution are 


(1) VE (g) = Y АА pogo, 
| i (A=) 


+ 


aon А; А; aog 

(2) AE (9, 8%) =— j 5 pog, i#j. 
(à; - Aj) 

See Theorem 13.5.1. 

In making inferences about a single ordered root, one treats l; as approxi- 
mately normal with mean A; and variance 2A?/n. Since l; is a consistent 
estimate of à; the limiting distribution of 

l;— А 
3 vn i 
(3) "i 


is N(0.1). A two-tailed test of the hypothesis А; = А? has the (asymptotic) 
acceptance region 


L-A? 
(4) -2(e)<y 5p $l)» 


where the value of the N(0, 1) distribution beyond z(e) is 1e. The interval 
(4) can be inverted to give a confidence interval for А, with confidence 1 — £: 


L; BEER 
(5) 1+ y2/nz(e) <= 1-2/nz(e) 


Note that the confidence coefficient should be taken large enough so 
V2 /nz(e) < 1. Alternatively, one can use the fact that the limiting distribu- 
tion of Ум (log l; — log А,) is N(0,2) by Theorem 423. 

Inference about components of a vector B® can be based on treating 5? 
as being approximately normal with mean p and (singular) covariance 
matrix 1/n times (1). 


11.6.2. Confidence Region for a Characteristic Vector 


We use the asymptotic distribution of the sample characteristic vectors to 
obtain a large-sample confidence region for the ith characteristic vector of Х 
[Anderson (1963a)]. The covariance matrix (1) can be written 


(6) вав’ = BAT BI", 
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where А; is the p X p diagonal matrix with 0 as the ith diagonal element and 
y^ /O; — Aj) as the jth diagonal element, j + i; A5 is the (p — 1) x (p — 1) 
diagonal matrix obtained from А, by deleting the ith row and column; and 
B is the p x (p- 1) matrix formed by deleting the ith column from B. Then 
h® = AC! BY Vn (БО — BO) has a limiting normal distribution with mean 0 
and covariance matrix 


(7) (M?) = AY BE'(BEAPBE)BEAT ! =1,_1, 


(8) AO ZO = n(b® = в) BrAt^ B; "(БФ — g^) 
has a limiting y?-distribution with p — 1 degrees of freedom. The matrix of 
the quadratic form in Yn (b? — B®) is 
O Bray tpt = X po [4-2 3 po» - go (X 2.3 
АВ -EB xtX go - go - 2 St Jo 
= АУ 27+ (1/А,)5 
because BA^!B' = У -!, BB’ = 1, and BAB’ = X. Then (8) is 
(10) n(b? — ВФ) А51 21 + (1/4) X| (P? — ВӘ) 
= nb? [AE] - 21+ (1/4) 2] b© 
- n| Aj? x 19 + (1/4) bO X 0 — 2], 
because B®” is a characteristic vector of У with root Aj, and of E^! with 


root 1/A,. On the left-hand side of (10) we can гері 
AMA ace X and Л, 
consistent estimators S and /; to obtain P and A; by he 


(11) n(b® — gey[ns- -21+ (1/4)$] (50 - p) 
| = п[1809'5 7:89 + (1/1)8'sgo - 2], 


which has a limiting y?-distribution with p — 1 degrees of freedom 

A confidence region for the ith characteristic vector of ¥ with confidence 
1 ~ & consists of the intersection of ВВ = 1 and the set of B® such that 
the right-hand side of (11) is less than x? (=), where Pr{ x? , > x? (2) == 
Note that the matrix of the quadratic form (9) is positive ‘semidefinite. | 
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This approach also provides a test of the null hypothesis that the ith 
characteristic vector is a specified В? (BBH = 1). The hypothesis is 
rejected if the right-hand side of (11) with B® replaced by BË exceeds 
PC ). 

Mallows (1961) suggested а test of whether some characteristic vector of 
X is By. Let By be px(p— 1) matrix such that В.В, = 0. if the null 
hypothesis is true, B, X and В, X are independent (because B, is а nonsingu- 
lar transform of the set of other characteristic vectors). The test is based on 
the multiple correlation between fX and B,X. In principle, the test 
procedure can be invcrted to obtain a confidence region. The usefulness of 
these procedures is limited by the fact that the hypothesized vector is not 
attached to a characteristic root; the interpretation depends on the root (e.g, 
largest versus smallest). 

Tyler (1981), (1983b) has generalized the confidence region (11) to include 
the vectors in a linear subspace. He has also studied casing the restrictions of 
a normally distributed parent population. 


11.6.3. Exact Confidence Limits on the Characteristic Roots 


We now consider a confidence interval for the entire set of characteristic 
roots of У, namely, A, > + > A, [Anderson (1965а)]. We use the facts that 
p^'xpo = А, pego =1, i=1,p, and В "УВ 20-2 ppt». 'Then 
ВХ апа BX are uncorrelated and have variances A, and A,, respec- 
tively. Hence nB®’SB/A, and „ВВР ГА, are independently dis- 
tributed as x? with n degrees of freedom. Let / and и be two numbers such 
that 


(12) 1— == Pr(nl < x2)Pr( x? < nu]. 
Then 
_ po'sgo per spe) 
(13) i-e n oe ae ee 
(p (р) (1), (1) 
се ВВ") 
<р ті 050 <А,, А, < m m 
bib=1 U Р ьъ=1 | 
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Theorem 11.6.1. A confidence interval for the characteristic roots of У 
with confidence at least 1 — & is 


(14) Ли А АИ. 
where l and и satisfy (12). 

A tighter inequality can lead to a better lower bound. The matrix H — 
nB'SB has characteristic roots z/,,..., nl, because В is orthogonal. We use 


the following lemma. 


Lemma 11.6.1. For any positive definite matrix H 
1 . 
(15) ch,(H) < п 5 (Н), 1=1,....р, 


where H^! = (h'!) and ch,(H) and св (Н) are the minimum and maximum 
characteristic roots of H, respectively. 


Proof From Theorem A.24 in the Appendix we have ch, (H) «hj < 
ch,(H) and 


(16) ch, (H^) <h" «ch( H^), i-1,.... p. 


Since ch,(H) = 1/ећ (НТ!) and ch,(H) = 1/ch,(H 7), the lemma follows. 
E 


The argument for Theorem 5.2.2 shows that 1/(A,h??) is distributed as 
x? with n — p + 1 degrees of freedom, and Theorem 4.3.3 shows that h?? is 
independent of A,,. Let /' and и’ be two numbers such that 


(17) 1- e= РР < x2}Pr{ x2 p41 < nu]. 
Then 

4 hy 1 В 
(18) 1-е= Рг{ Ш SEE 


l 1 
sr < Àp А <7) 


u ^ 


since ch, (H) = nl, and chi(H) = nl. 
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Theorem 11.6.2. А confidence interval for the characteristic roots of X 
with confidence at least 1 — є is 


l l 
(19) <А <А хт, 


where l' and u' satisfy (17). 


Anderson (1965а, 1965) showed that the above confidence bounds are 
optimal within the class of bounds 


(20) f(l. sl) = 8A SB (Ls sl). 


where f and g are homogeneous of degree 1 and are monotonically nonde- 
creasing in each argument for fixed values of the others. If (20) holds with 
probability at least 1 — e, then a pair of numbers и’ and l’ can be found to 
satisfy (17) and 


(21) fl... lp) < ть y €g(h sl) 


The homogeneity condition means that the confidence bounds are multiplied 
by c? if the observed vectors are multiplied by c (which is a kind of scale 
invariance). The monotonicity conditions imply that an increase in the size of 
S results in an increase іп the limits for € (which is a kind of consistency). 

The confidence bounds given in (31) of Section 10.8 for the roots of X 
based on the distribution of the roots of S when X = I are greater. 


11.7. TESTING HYPOTHESES ABOUT THE CHARACTERISTIC 
ROOTS OF A COVARIANCE MATRIX 


11.7.1. Testing a Hypothesis about the Sum of the Smallest 
Characteristic Roots 


An investigator may raise the question whether the last p-m principal 
components may be ignored, that is, whether the first т principal compo- 
nents furnish a good approximation to X. He may want to do this if the sum 
of the variances of the last principal components is less than some specified 
amount, say у. Consider the null hypothesis 


(1) Н:А + "НА, > У, 
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where у is specified, against tlie alternative that the sum is less than у. If the 
characteristic roots of X, are different, it follows from Theorem 13.5.1 that 


р р 
(2) Yn L I~ У À; 
i=m+1 i=m+1 
has a limiting normal distribution with mean 0 and variance 25Р. „1 А2. The 
variance can be consistently estimated by 2? 12. Then a rejection region 


i—mcli 


with (large-sample) significance level = is 


P 2ER mal? 
3 L< у- i2m-*l*i 
(3) Db < o (Q8) 
where 2(2=) is the upper significance point of the standard normal distribu- 
tion for significance level 5. The (large-sample) probability of rejection is e if 
equality holds in (1) and is less than = if inequality holds. 

The investigator may alternatively want an upper confidence interval for 
УР ть A; with at least approximate confidence leve] 1 — e. It is 


c p, 217, p 
(4) Y us У L+ Утга 
ф=т+ | i=m+l yn 


If the right-hand side is sufficiently small (in particular less than y), the 
investigator has confidence that the sum of the variances of the smallest 


p —m principal components is so small they can be neglected. Anderson 
(19632) gave this analysis also in the case that А, = = = Ap 


z(2e). 


11.7.2. Testing a Hypothesis about the Sum of the Smallest 
Characteristic Roots Relative to the Sum of All the Roots 


The investigator may want to ignore the last р ~ т principal components if 
their sum is small relative to the sum of all the roots (which is the trace of the 
covariance matrix). Consider the null hypothesis 


Amar bo ЖА 

5 . = ml р 
(5) Hef) =a 28, 
where 6 is specified, against the alternative that f(A) < 8. We use the fact 
that 


(6) et 
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Then the asymptotic variance of (Г) is 


2 ET 
(7) 055) (А+: +.) es) QR e +45) 


when equality holds in (5), by Theorem 42.3. The null hypothesis H is 
rejected if vn [КИ — 6] is less than the appropriate significance point of the 
standard normal distribution times the square root of (7) with A'sreplaced by 
Гѕ and tr X by tr S. Alternatively one can construct а large-sample confidence 
region for f(A). A confidence region of approximate confidence 1— is 
[z =20)] | 


(8) 


Eau) ны + бт + AER Eua TT 
—— =“ РА . 
УР 1А; УР 11, Yn (ХР. 


If the right-hand side is sufficiently small, the investigator may be willing to 
let the first principal components represent the entire vector of measure- 
ments. 


11.73. Testing Equality of the Smallest Roots 


Suppose the observed X is given by V + U + p, where V and U are unobserv- 
able random vectors with means 0 and p is an unobservable vector of 
constants. If &UU' = c?I, then U can be interpreted as composed of errors 
of measurement: uncorrelated components with equal variances. (It is 
assumed that all components of X are in the same units.) Then V can be 
interpreted as made up of the systematic parts and is supposed to lie in an 
m-dimensional space. Then ФУР’ = Ф is positive semidefinite of rank m. 
The observable covariance matrix X = Ф + о?Т has a characteristic root of 
c? with multiplicity p — т (Problem 11.4). 

In this subsection we consider testing the null hypothesis that Àj, = 7 
= Àp That is equivalent to the null hypothesis that X-44-o?Il,where Ф is 
positive semidefinite of rank m. In Section 10.7, we saw that when m = 0, the 
likelihood ratio criterion was the 3pNth power of the ratio of the geometric 
mean to the arithmetic mean of the sample roots. The analogous criterion 
here is the 2Nth power of 


Пт L 
(ХР mail) 


p-m 


(9) (p-m) 
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It is also the likelihood ratio criterion, but we shall not derive it. [See 
Anderson (1963a).] Let Ум (I; — Ams) = dj, i9 m 1,..., p. The logarithm 


of (9) multiplied by —п is asymptotically equivalent under the null hypothesis 
to ' 


P р , 
(10) -nalog [T l + n( p- m) log net 


і=т+1 pm 


2 р - M, 
--n У вА, п) ирт) ор emnt) а) 


і=т+1 Dm 
p d. 
=|- У eg + c ета s Eet 
i-m4l Ama (p-m)A,.,m 
У di d? 
p Y | Y ——— es 
і=т+1 RE 2A) n 
pP d. р Е 
+(р-т) оны ны. 
(p-m)A, алп" 2(р-т) Main 


-a| So d mad) e] 


2 Щи n 
і=т+1 2$ i 2(p—-m)A,,,n 


+01). 


т+1 | 1=т+1 


1 Sg- | s | 
>52 . di- т d; 
2X Pom |i 


It is shown in Section 13.5.2 that the limiting distribution of d,,,,,..., d, is 
the same as the distribution of the roots of a symmetric matrix U, = (м,,), 
i,j=m+1,..., р, whose functionally independent elements are independent 
and normal with mean 0; ап off-diagonal element u,,, i<j, has variance 
№ ‚|, and a diagonal element u; has variance 2;,,;. See Theorem 13.5.2. 
Then (10) has the limiting distribution of 


1 1 2 
(11) 2X (= Uh- pom (tt Un) | 


1 , 1 2 
2X |00, gos eros 
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Thus Y, . ;j,/A;,.1 is asymptotically x^ with (p — mXp — m — 1) degrees 
of freedom: ЦУ? „ан ОР uu Ip — m)|/A2,,, is asymptotically x? 
with p = т — 1 degrees of freedom. Then (10) has a limiting x?-distribution 
with 1 p — m + 2X p — m — 1) degrees of freedom. The hypothesis is rejected 
if the left-hand side of (10) is greater than the upper-tailed significance point 
of the x distribution. If the hypothesis is not rejected, the investigator may 
consider the last p – т principal components to be composed entirely of 
error. 

When the units of measurement are not all the same, the three hypotheses 
considered in Section 11.7 have questionable meaning. Corresponding 
hypotheses for the correlation matrix also have doubtful - interpretation. 
Moreover, the last criterion does not have (usually) a y?-distribution. More 
discussion is given by Anderson (19632). 

The criterion (9) corresponds to the sphericity criterion of Section 10.7, 
and the number of degrees of freedom of the corresponding x ?-distribution 
is (p^ mXp-m-1)- 1. 


11.8. ELLIPTICALLY CONTOURED DISTRIBUTIONS 


11.8.1. Observations Elliptically Contoured 


Let ху... ху be М observations on a random vector X with density 
(1) Il igi- vy w(x- v], 


where W is a positive definite matrix, R?-—(x—vyW (x—v), and 
KR! < ос. Define к=рК' ДСК?) 0р + 2) – 1. Then óX-v-& and 
£X — »XX -vy = (Кр) = X. 

The maximum likelihood estimators of the principal components of X are 
the characteristic roots and vectors of £-( «К? /p) Â given by (20) of 
Section 3.6. Alternative estimators are the characteristic roots and vectors of 
S, the unbiased estimator of X. The asymptotic normal distributions of these 
estimators are derived in Section 13.7 (Theorem 13.7.1). Let X = BAB' and 
S = BLB', where A and Г. are diagonal and В and B are orthogonal. Let 
р= /NGL А) and С = /N(B – В). Then the limiting distribution of D 
and G is normal with G and D independent. 

The variance of d, is (2 + 3«)A;, and the covariance of d; and d; (i #j) is 
KA, À,. The covariance of g; is 


| р АЛА 
(2) е(в) = (1+ к) У — B... 
А; А 
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The covariance of в; and gj is 


(3) ме (8,8) = - (14 «) — Bg. 


(А ~ A) 


For inference about a single ordered root A, the limiting sta 
distribution of УМ (1, — A)/G/2(2 + 34) Ij) can be used. g standard normal 

For inference about a single vector the right-hand side of (11) in Section 
11.6.2 can be used with S replaced by (1 + &)S and S^! by S^1/(1 + R) 

It is shown in Section 13.7.1 that the limiting distribution of the logarithm 
of the likelihood ratio criterion for testing the equality of the g=p—m 
smallest roots is the distribution of (1+ к) хи, уо. P 


11.8.2. Elliptically Contoured Matrix Distributions 
Suppose the density of X = (x;,..., x,y) is 
МГ ео) (х = v)] 
=| P|" g[tr Ap"! + N(Z-v)'w-!(x-v)], 


where A —(X — e€yxXXX — €yX) = nS and n2 N — 1. Thus x and A are a 
sufficient set of statistics. 


Now consider А = YY' having the density g(tr А). Let А = BLB', where L 
is diagonal with diagonal elements / > --- > l, and B is orthogonal with 
Рп >0. Then L and В are independent; the roots I, ...,1, have the density 


(18) of Section 13.7, and the matrix B has the conditional Haar invariant 
distribution. 


PROBLEMS 
11.1. (Sec. 11.2) Prove that the characteristic vectors of (; "| are 
| 1/42 1/42 
and , 
1//2 -1/42 
corresponding to roots 1+ p and 1 — р. 


11.2. (Sec. 11.2) Verify that the proof of Theorem ` i 
\ m 11.2.1 yields a proof of Th 
А.2.1 of the Appendix for any real symmetric matrix. P NN 
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113. 


1144. 


11.5. 


11.6. 


11.7. 


11.8. 


11.9. 


11.10. 


PRINCIPAL COMPONENTS "Ae 


(Sec. 11.2) Let z=y +x, where y= £x - 0, буу = Ф, &xx' = 07, бук 
— 0. The p components of y can be called systematic parts, and the compo- 
nents of x errors. 


(a) Find the linear combination y'z of unit variance that has minimum error 
variance (i.e., y'x has minimum variance). 

(b) Suppose ¢,,+ c? — 1, i=1,..., p. Find the linear function y'z of unit 
variance that maximizes the sum of squares of the correlations between Z; 
and y'z, i— 1,..., p. 

(c) Relate these results to principal components. 


(Sec. 11.2) Let E =Ф +о?1, where Ф is positive semidefinite of rank m. 


Prove that each characteristic vector of Ф is a vector of X, and each root of € 
is a root of Ф plus c?. 


(Sec. 11.2) Let the characteristic roots of X be А> à > «2 A, > 0. 


(a) What is the form of X if A; = А = + = Ар > 0? What is the shape of an 
ellipsoid of constant density? 

(b) What is the form of У if A4,» Л, = = = Àp > 0? What is the shape of an 
ellipsoid of constant density? 


(с) What is the form of X Ил, = = = Ap-| > A, > 0? What is the shape of 
the ellipsoid of constant density? 


(Sec. 11.2) Intraclass correlation. Let 
E-oc?[(1—p)I * pse'], 


where є —(1,..., 1). Show that for р> 0, the largest characteristic root is 
a ?[1 + (p — Dp] and the corresponding characteristic vector is €. Show that if 
£'x = 0, then x is a characteristic vector corresponding to the root с *(1— p). 
Show that the root с X1 — p) has multiplicity p — 1. 


(Sec. 11.3) In the example of Section 9.6, consider the three pressing opera- 
tions (x5, x4, x5). Find the first principal component of this estimated covari- 
ance matrix. [ Hint: Start with the vector (1, 1, 1) and iterate.) 


(Sec. 11.3) Prove directly the sample analog of Theorem 11.2.1, where Ух, = 
0, Lx, x, = A. 


(Sec. 11.3) Let /, and l, be the largest and smallest characteristic roots of $, 
respectively. Prove £l, > A, and £l, < Ар. 


(Sec. 11.3) Let U, = ВОХ be the first population principal component with 
variance 7(U;) = Л, and let V, =b™'X be the first sample principal compo- 
nent with (sample) variance J, (based on S). Let S* be the covariance matrix 
of a second (independent) sample. Show £5 'S*b™ < A. 
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11.11. 


11.12. 


11.13. 


11.14. 


11.18. 


11.16. 


(Sec. 11.3) Suppose that о’; > 0 for every i,j [E = (0;;)]. Show that (a) the 
coefficients of the first principal component are all of the same sign. and 
(b) the coefficients of each other principal component cannot be all of the 
same sign. 


(Sec. 11.4) Prove (4) when A, > Az. 
(а) Show ХУ’ = ВА.. 
(b) Show 


ig: ig| 1 " 
Xo = GBA'B’X,0, = iB зл) B’xw)- 


where t; = Mi оз; and s; — 1/ ухх. 
(c) Show 


1 i 
lim (x^) =ЕБц, 


jax \ Ay 


where Е has 1 in the upper left-hand position and 0’s elsewhere. 
(d) Show lim, , £4,A))? = 1/( 07x, 
(e) Conclude the proof. 


X Ti Fy K- 1 0 
tog 5 |" 0 H| 
where Н —1, – 2аа' and а has р – 1 components. Show that œ can be 
chosen so that in 


(Sec. 11.4) Let 


, 
Tii SH 


KXK- Hop НУН 


Ho, has all 0 components except the first. 


(Sec. 11.6) Show that 


/ /2 
log /; — 2 (е) « log А; € log /; + z z(e) 


is a confidence interval for log A; with approximate confidence 1 ~ z. 
(Sec. 11.6) Prove that и’ <и if l' =] and р> 2. 


(Sec. 11.6) Prove that и « u* И /=/* and р> 2, where /* and и* are the ! 
and u of Section 10.8.4. 
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11.17. The lengths. widths, an 
(Jolicoeur and Mosimann 


d heights (in millim 
(1960)] are given b 


components and their variances. 


Case 


No. Length Width Height 


оо 40 0 RU T 


74 
78 


37 
35 


Case 
No. 


13 
14 
15 


PRINCIPAL COMPONENTS 


Length Width 


116 
117 
117 
119 
120 
120 
121 
125 
127 
128 
131 
135 


90 


eters) of 24 male painted turtles 
elow. Find the (sample) principe] 


Height 


43 
41 
41 
41 
40 


CHAPTER 12 


Canonical Correlations 
and Canonical Variables 


12.1. INTRODUCTION 


In this section we consider two sets of variates with a joint distribution, and 
we analyze the correlations between the variables of one set and those of the 
other set. We find a new coordinate system in the space of each set of 
variates in such a way that the new coordinates display unambiguously the 
system of correlation. More precisely, we find linear combinations of vari- 
ables in the sets that have maximum correlation; these linear combinations 
are the first coordinates in the new systems. Then a second linear combina- 
tion in each set is sought such that the correlation between these is the 
maximum of correlations between such linear combinations as are uncorre- 
lated with the first linear combinations. The procedure is continued until the 
two new coordinate systems are completely specified. 

The statistical method outlined is of particular usefulness in exploratory 
studies. The investigator may have two large sets of variates and may want 
to study the interrelations. If the two sets are very large, he may want 
to consider only a few linear combinations of each set. Then he will want to 
study those 1.пеаг combinations most highly correlated. For example, one set 
of variables may be measurements of physical characteristics, such as various 
lengths and breadths of skulls; the other variables may be measurements of 
mental characteristics, such as scores on intelligence tests. If the investigator 
is interested in relating these, he may find that the interrelation is almost 


An Introduction to Multivariate Statistical Analysis, Third Edition. 


By T. W. Anderson 
ISBN 0-471-36091-0 Copyright € 2003 John Wiley & Sons, Inc. 
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completely described by the correlation between the first few canonical 
variates. 

The basic theory was developed by Hotelling (1935), (1936). 

In Section 12.2 the canonical correlations and variates in the population 
are defined; they imply a linear transformation to canonical form. Maximum 
likelihood estimators are sample analogs. Tests of independence and of the 
rank of a correlation matrix are developed on the basis of asymptotic theory 
in Section 12.4. 

Another formulation of canonical correlations and variates is made in the 
case of one set being random and the other set consisting of nonstochastic 
variables; the expected values of the random variables are linear combina- 
tions of the nonstochastic variables (Section 12.6). This is the model of 
Section 8.2. One set of canonical variables consists of linear combinations of 
the random variables and the other set consists of the nonstochastic vari- 
ables; the effect of the regression of a member of the first sect on a member of 
the second is maximized. Linear functional relationships are studied in this 
framework. І 

Simultaneous equations models are studied in Section 12.7. Estimation of 
a single equation in this model is formally identical to estimation of a single 
linear functional relationship. The limited-information maximum likelihood 
estimator and the two-stage least squares estimator are developed. 


12.2. CANONICAL CORRELATIONS AND VARIATES 
IN THE POPULATION 


Suppose the random vector X of p components has the соуагіалсе matrix X 
(which is assumed to be positive definite). Since we are only interested in 
variances and covariances in this chapter, we shall assume &X = 0 when 
treating the population. In developing the concepts and algebra we do not 


need to assume that X is normally distributed, though this latter assumption 


will be made to develop sampling theory. 
We partition X into two subvectors of p, and p, components, respec- 
tively, 


(1) x- (55. 


xo? 


For convenience we shall assume p, <f, The covariance matrix is parti- 
tioned similarly into p, and p, rows and columns, 


X a 
2 У = . 
9) (2" Xy 
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In the previous chapter we developed a rotation of coordinate axes to a new 
system in which the variance properties were clearly exhibited. Here we shall 
develop a transformation of the first p, coordinate axes and a transformation 
of the last p, coordinate axes to a new (p, + p;)-system that will exhibit 
clearly the intercorrelations between ХХ and XO. 

Consider an arbitrary linear combination, U = a'X“, of the components 
of ХО), and an arbitrary linear function, V = y'XO, of the components of 
XO. We first ask for the linear functions that have maximum correlation. 
Since the correlation of a multiple of U and a multiple of V is the same as 
the correlation of U and V, we can make an arbitrary normalization of « and 
y. We therefore require а and y to be such that U and V have unit 
variance, that is, 


(3) 12-4U?-4a'XUOXV"'a =a E a, 
(4) 1= Фи? = gy XOX ул. 


We note that ZU = £a'XU —a'£ XU = 0 and similarly £V = 0. Then the 
correlation between U and V is 


(5) EUV = £a'X?XO'4 — оа’ v. 
Thus the algebraic problem is to find œ and y to maximize (5) subject to (3) 
and (4). 
Let 
(6) шоу Ala Ena 1) – зи(у'Х 1), 


where A and р are Lagrange multipliers. We differentiate y with respect to 
the elements of a and y. The vectors of derivatives set equal to zero are 


ð 
(7) A = Xuy-AXga-0, 
: 9 
(8) a -2Xa-puXy-0. 


Multiplication of (7) on the left by a’ and (8) on the left by y’ gives 
(9) и’ 127 — Ла 'Х ца = 0, 
(10) y'a- uy у = 0. 


Since a'Z,,o = 1 and y'Z; y = 1, this shows that A= p =a’? y. Thus (7) 


490 CANONICAL CORRELATIONS AND CANONICAL VARIABLES 


and (8) can be written as 
(11) — ЛУ на + у= 0, 
(12) Уа —-ALyy =9, 
since X4, = X,,. In one matrix equation this is 
| АХ ц 5р | [© 
03) | x. axo 
In order that there be a nontrivial solution [which is necessary for a solution 
satisfying (3) and (4)], the matrix on the left must be singular; that is, 


- АХ} X5 


Zn АХ ni 


(14) 


The determinant on the left is a polynomial of degree p. To demonstrete 
this, consider a Laplace expansion by minors of the first p, columns. One 
term is | -АХ l-l - АХ | = CL A)? ^] 11-121. The other terms in the 
expansion are of lower degree in A because one or more rows of each minor 
in the first p, columns does not contain A. Since X is positive definite, 
1X l lEz] #0 (Corollary А.1.3 of the Appendix). This shows that (14) is a 
polynomial equation of degree p and has p roots, say М> А > > А. la’ 
and у’ complex conjugate in (9) and (10) prove А real.] 

From (9) we see that А = а’, is the correlation between U = а XO 
and V —4'X'? when а and y satisfy (13) for some value of A. Since we 
want the maximum correlation, we take A — A,. Let a solution to (13) for 

=A, be a”, y, and let U, = a XO and V, = yO' XO. Then U, and И, 
are normalized linear combinations of Х and X, respectively, with maxi- 
mum correlation. 

We now consider finding a second linear combination of ХХ, say (= 
a.' ХХ), and a second linear combination of X®, say V = y'X @) such that of 
all linear combinations uncorrelated with U, and V, these have maximum 
correlation. This procedure is continued. At the rth step we have obtained 
linear combinations 0, = «ХО, И, = у’Х®,..., U =a XO, И = 
y'X® with corresponding correlations [roots of (14)] A? = A, А, ...,А, 
We ask for a linear combination of X, U = a' ХХ, and a linear combina- 
tion of XO, V = y'X®, that among all linear combinations uncorrelated with 
U Vi, ..., U, V,, have maximum correlation. The condition that U be uncor- 
related with U, is 


(15) 0= FUU, = £a' ХХ’ = a а. 
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Then 

(16) EUV, = aD yy) = M'Y a = 0. 

The condition that V be uncorrelated with V, is 

(17) 0- £V, = "Хуб. 

By the same argument we have 

(18) EVU, = y' La = 0. 


We now maximize ZU, ,,V.,,, choosi isfy (S 
ha Meats ng а and y to satisfy (5 
and (17) for i= 1,2,..., r. Consider Y У © 0, 05. 


(19) Vou -e'Epoy-jA(e'X4a—1)-3u(y'Xoy-1) 


r r 
+ У по’ паб + У BY Eny”, 
i=1 i-i 
whore À, h, Ур £5,0,...,0, are Lagrange multipliers. The vectors of 
partial derivatives of ,,, with respect to the elements of о and y are set 
equal to zero, giving 


t ow, 1 Z 

(20) | Ja = ру Аца + ))wXquaO0-Q, 
і=1 
9}, 1 | Z 

(21) 23 = Ха — их ру + 9х) = 0. 


i-i 
Multiplication of (20) on the left by a? and (21) on the left by 4" gives 
(22) 0-7 va" Xia? = V 


(23) 0- 0y Eny” = 6;. 


Thus (20) and (21) are simply (11) and (12) or alternatively (13). We therefore 
take the largest A,, say, А+0, such that there is a solution to (13) satisfyin, 
(3), (4), (15), and (17) for i= 1,...,r. Let this solution be «+В. yt а 
let U,,, =a *P'XO and V = yD X®, rom 
This procedure is continued step by step as long as successive solutions 
can be found which satisfy the conditions, namely, (13) for some А, (3) (4) 
(15), and (17). Let m be the number of steps for which this can be done. Now 
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we shall show that m=p,, (<p). Let А = (a(? > а"), = (у = 


y 0), and 

А 0 es 0 

0 AO e 0 
(24) А=!. . : 

0 0 EM мт) 


The conditions (3) and (15) can be summarized as 
(25) A'Z A=]. 


Since X,, is of rank p, and J is of rank т, we have m <р. Now let us show 
that m « p, leads to a contradiction by showing that in this case there is 
another vector satisfying the conditions. Since A'Z,, is m X p,, there exists a 
p, X(p, — т) matrix E (of rank p, —т) such that A'E,,E = 0. Similarly 
there is a p; X (p, — m) matrix F (of rank p, — m) such that T£ „F = 0. 
We also have I'1E;,E = A A'Z, E = 0 and А’, F= АГУ F= 0. Since E 
is of rank p, — m, E'X,,E is nonsingular (И m <p,), and similarly Е'Ў Е 
is nonsingular. Thus there is at least one root of 


-wE'X4E — E'XyF 
(26) ; =0, 
F'X4E | -wF'X4F 


because |E'E,, | -| F'E;; F| #0. From the preceding algebra we see that 
there exist vectors a and 5 such that 


(27) E'S) Fb = vE'X, Ea, 
(28) Е'У, Еа = уЕ'У,, Fb. 


Let Ea = в and Fb = h. We now want to show that v, g, and й form а 
new solution А+), atO yD, Let Xi!X,h-k. Since A'Z |= 
A'S, Fb = 0, К is orthogonal to the rows of A'X,, and therefore is a linecr 
combination of the columns of E, say Ec. Thus the equation Xj; = Хук 
can be written 


(29) У Fb= %,, Ec. 
Multiplication by E' on the left gives 
(30) E'S, Fb = Е'Х Ес. 


Since E'X,,E is nonsingular, comparison of (27) and (30) shows that c = va, 
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and therefore k = vg. Thus 

(31) У.В = УУ 8. 
In a similar fashion we show that 

(32) X4g-vXgh. 


Therefore v= A"*P, g о"), h =+) is another solution. But this is 
contrary to the assumption that A", a”), y? was the last possible solution. 
Thus m =p. 

The conditions on the A's, а ’5 and y’s can be summarized as 


(33) A'Z ASI, 
(34) A'S, =A, 
(35) Г! Г -I. 


Let T, = (y+) 402) be a p; X (p; - р) matrix satisfying 
(36) D;x,4T,-9, 
(37) l ГУ Г, = 1. 


Any Г, can be multiplied on the right by an arbitrary (p; ~ P1) Xp; -pj) 
orthogonal matrix. This matrix can be formed one column at a time: Oi? 
is a vector orthogonal to ХГ, and normalized so yl Pit DIS yt = |; 
40:*2 ig a vector orthogonal to Xj4(I, y‘’'*”) and normalized so 
y 0*9 X yt = 1; and so forth. Let Г = (Г, P); this square matrix is 
nonsingular since P' X4 P = Г. Consider the determinant 


А’ 0 
28 0 г! || Mn Xo 0 0 
(28) 0 г, | Хи АХ» ог n 
2 
—AI A 0 
= А -М 0 
0 0 -àl 
РОВ Р2-Р1 —AI A 
=(-^) А A 


2(-Aay/ АДА A(-AI) A] 
— ( л) АЫ - А?| 
= (=A) (А _ MOY, 
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Except for a constant factor the above polynomial is 


АХ Xi 
Xs —ALn 


(39) 


Thus the roots of (14) are the roots of (38) set equal to zero, namely, 
A= +, i=1,..., pp and A=0 (of multiplicity p; = ру. Thus Oi Ар) 
= (А... Ap 0, - +30, — Apa eres = Ар). The set (A62), і = l,a Pi is the set 
{А2}, i= 1,. Py To show that the set (А9), і= L. Par is the set {a}, 
i=1,..., рь we only need to show that A? is nonnegative (and therefore is 
one of the à; i= 1,..., p). We observe that 


Ў ү = - АУ ( а), 
(40) 1 


(41) ,(-а) = – ЭХ б; 


thus, И А, а, y? is a solution, so is №, = a), yO) If x were 
negative, then —A” would be nonnegative and —ACO z А. But since À 
was-to be maximum, we must have A? > — A? and therefore №" 2 0. Since 
the set (А) is the same as (Aj, i = 1,..., pi, we must have А = Aj. 

Let 


U; 
(42) U-| | | 2АХО, 

Up 

Vi 

. А 2 
(43) - ио |: | гхо, 

Vp, 

Vis 

© laryo 
(44) yo-| : =x. 

V. 


The components of U are one set of canonical variates, and the components 
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of V « (VO' ИО у are the other set. We have 


U А 0 
(45) e| yo (v yor yo) = 0 г, Xu Xu [o M г. 
yo 0 T; X4 Ly |(0 i 2 
I, A 
= |А I, 0 , 
9 9 1-р 
where 
A, 0 0 
0 A, 0 
0 0 А 


Definition 12.2.1. Let X - (ХОУ ХОУ, where X has p, components 
and XO has p, (=p — p, z p,) components. The rth pair of canonical variates 
is the pair of linear combinations U, = aU XO and V, = «y0' XO, each of unit 
variance and uncorrelated with the first r — 1 pairs of canonical variates and 
having maximum correlation. The correlation is the rth canonical correlation. 


Theorem 12.2.1. LetX = (ХО' XO")' be a random vector with covariance 
matrix У. The rth canonical correlation between X and X is the rth largest 
root of (14). The coefficients of aX and y'X® defining the rth pair of 
canonical variates satisfy (13) for А = А, and (3) and (4). 


We can now verify (without differentiation) that U,,V, have maximum 
correlation. The linear combinations a'U = (a'A')X® and = (p'T)xO 
are normalized by a'a = 1 and b'b = 1. Since А and Г are nonsingular, апу 
vector а can be written as Аа and any vector y can be written as ГЬ, and 
hence any linear combinations o' X? and y’X® can be written as a'U and 
b'V. The correlation between them is 


(47) a'(A 0)b= Y A;ajb;. 


i21 


Let A;2;/ y X( Аа}? = cy. Then the maximum of 


a'( 4 0)b= V E(;a;) Ecjb; 
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with respect to b is for b; = c;, since Lc,b, is the cosine of the angle between 
the vector b and (c,,...,c, ,0,..., 0). Then (47) is 


jp 


VENA? = VEP- Мм, 


and this is maximized by taking a; = 0, i=2,...,p,. Thus the maximized 
linear combinations аге U, and Vj. In verifying that U, and V, form the 
second pair of canonical variates we note that lack of correlation between U, 
and a linear combination a'U means 0 = ZU,a'U = €U,La,U, =a, and lack 
of correlation between V, and b'V means 0 = b,. The algebra used above 
gives the desired result with sums starting with i = 2. 

We can derive a single matrix equation for a or vy. If we multiply (11) by A 
and (12) by Xj, we have 


(48) АХ у= XXe, 
(49) УНУ а= Ау. 


Substitution from (49) into (48) gives 


(50) У 55 ла = АУ а 

от 

(51) (Znin Èn- АХ )а= 0. 

The quantities А?,..., A2. satisfy 

(52) ХХХ - „Х| =0, 

and а0),..., «(РО satisfy (51) for А2 = А?,..., А2, respectively. The similar 
equations for y®,..., y? occur when А2 = №,..., А2, are substituted with 
(53) (X4 Xij!X5-XX4)w-0. 


Theorem 12.2.2. The canonical correlations are invariant with respect to 
transformations X* = C, X, where С; is nonsingular, i = 1,2, and any func- 
tion of X. that is invariant is a function of the canonical correlatios. 


Proof. Equation (14) is transformed to 


(54) 
_ АС, С! СХ С; 
СХ у Ci m AC Ey C; 


с, 0 
0 C, 


-АХ Хр 
X4 АХ, 


сі 0 
о cC 


> 
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and hence the roots are unchanged. Conversely, let. f(X Xiz >») be a 
vector-valued function of € such that (СХС, СХС. C. ХС) = 
и» X9, Za) for all nonsingular C, and C,. If C; =A and C; = Г’, then 
(54) is (38), which depends only on the canonical correlations. Then f= 
Ја, (4,9) Р). " 


We can make another interpretation of these developments in terms of 
prediction. Consider two random variables U and V with means 0 and 
variances 02 and о? and correlation p. Consider approximating U by а 
multiple of V, say bV; then the mean squared error of approximation is 


(55) &(U — bV)! = aj - 260,0. p + ba; 
= aj (1 = p?) + (bo, = pa) 


This is minimized by taking b = o, p/o,. We can consider БИ as a linear 
prediction of U from V; then a, (1— p?) is the mean squared error of 
prediction. The ratio of the mean squared error of prediction to the variance 
of U is o,2(1 — p?)/ g? = 1 — р?; the complement is a measure of the relative 
effect of V on U or the relative effectiveness of V in predicting U. Thus the 
greater p? or |p| is, the more effective is V in predicting U. 

Now consider the random vector X partitioned according to (1), and 
consider using a linear combination V = y'X@ to predict a linear combina- 
tion U = a'X. Then V predicts U best if the correlation between U and V 
is a maximum. Thus we can say that a XC? is the linear combination of 
X that can be predicted best, and * 5 ' X*? is the best predictor [Hotelling 
(1935)]. 

The mean squared effect of V on U can be measured as 


4 


(56) &(bVy = gi T ey? = pla? 
23 


и? 
u 


and the relative mean squared effect can be measured by the ratio 
ФУУ / £U? = p?. Thus maximum effect of a linear combination of Х on 
a linear combination of X is made by y(U'X? on aC" XM, 

In the special case of p, — 1, the one canonical correlation is the multiple 
correlation between X = X, and X9. 

The definition of canonical variates and correlations was made in terms of 
the covariance matrix У = 4(X- EXXX- £X). We could extend this 
treatment by starting with a normally distributed vector Y with p tp; 
components and define X as the vector having the conditional distribution of 
the first p components of Y given the value of the last p, components. This 
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would mean treating X, with mean EX, = Ov; the elements of the 
covariance matrix would be the partial covariances of the first p elements 
of Y. 

The interpretation of canonical variates may be facilitated by considering 
the correlations between the canonical variates and the components of the 
original vectors [е.2., Darlington, Weinberg, and Wahlberg (1973)]. The 
«covariance between the jth canonical variate U; and X; is 

Pi nm 
(57) fU, X= ё Y aX, X, — у afa; 
k=1 k=1 
Since the variance of U; is 1, the correlation between И; and X; is 


Ур. aP Oki 


(58) Corr(U;, X;) = Jo, 
Tii 


An advantage of this measure is that it does not depend on the units of 
measurement of Х,. However, it is not a scalar multiple of the weight of X; 
іп U, (namely. æj”). 

A special case is X; = I, X5; = I. Then 


(59) А'А = І, Г'Г=1, AX Г=(А 9) 

From these we obtain 

(60) X4-A(A Oj’, 

where A and Г are orthogonal and A is diagonal. This relationship is known 
as the singular value decomposition of X,,. The elements of A are the square 
roots of the characteristic roots of Х Хт, and the columns of A are 
characteristic vectors. The diagonal elements of A are square roots of the 


(possibly nonzero) roots of Х.Х 1, and the columns of Г are the character- 
istic vectors. 


12.3. ESTIMATION OF CANONICAL CORRELATIONS 
AND VARIATES 


12.3.1. Estimation 


Let ху... ху be № observations from Му, X). Let x, be partitioned into 
two subvectors of p, and р, components, respectively, 
x) 
(1) х= | oy} a=1,...,N. 
Xa 
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The maximun likeli i iti in (2 
T ikelihood estimator of X [partitioned as in (2) of Section 12.2] 


$: $ N 
(2) 5 = A ME =} — F — zy 
{221 Xn N E (Fa х) (х, х) 


1 |500 -20)(х0 -z*y (xo -39)(xo -zoy 
а 
(0-80) (xP #0)’ У(хо- #2) (хо — #0 


deir maximum likelihood estimators of the canonical correlations A and 
canonical variates defined by A and T' involve applying the algebra of th 
previous section to $. The matrices A, A, and Г, are uniquely defined if : 
assume the canonical correlations different and that the first nonzero ele- 
men of each column of A is positive. The indeterminacy in T, allows 
пи mp ication on the right by a (p, - Di) X Cp; — pj) orthogonal matrix; this 
тае ney can pe removed by various types of requirements, for example 
i их formed by the lower р, – г 
triangular with positive diagonal elements. "aplication of ‘Corollary 371 


then shows that the maxim ikeli , 
um like 
roots of lihood estimators of A,,..., A, are the 


(3) 


4 
(4) A 0 
(5) aE AD = 1, 


A 


Г, satisfies 
(6) Î $, f, =0 
(7) f$, =1. 


W hen the other resirictions on I are made. А. а 9 A e y 
2 3 у Г, n are uniqu. 1 
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Theorem 12.3.1. Let x,,..., xy be М observations from N(p, X). Let X, be 
partitioned into p, and p, (р, <p>) rows and columns as in (2) in Section 12.2, 
and let x, be similarly partitioned as in (1). The maximum likelihood estimators 
of the canonical correlations are the roots of (3), where $, у are defined by (2). 
The maximum likelihood estimators of the coefficients of the jth canonical 


components satisfy (4) and (5), j = 1,..., py; the remaining components satisfy 
(6) and (7). 


In the population the canonical correlations and canonical variates were 
found in terms of maximizing correlations of linear combinations of two sets 
of variates. The entire argument can be carried out in terms of the sample. 
Thus &'x® апа 40x have maximum sample correlation between any 
linear combinations of x{ and x, and this correlation is /,. Similarly, 
&2'x апа 4O'x€ have the second maximum sample correlation, and so 
forth. 

It may also be observed that we could define the sample canonical variates 
and correlations in terms of S, the unbiased estimator of 5. Then a? 


= V(N 1) 80, c? = J(N — 1)/N 9, and l; satisfy 


(8) S5, c0 = 8а, 
(9) Sy, a) = 18%, 
(10) aU" S, a) = 1, cO, c = 1. 


We shall call the linear combinations a? x® and сФ’х@ the sample 
canonical variates. 


We can also derive the sample canonical variates from the sample correla- 
tion matrix, 


$; Sij R R 
oy eoi] rz] en = 
Let 

Ysu 0 e 0 
(12) S= ° K 0 , 
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Vs erst 0 E 
(13) $,- ° Ут » o 
0 0 9 V 


Then we can write (8) through (10) as 


(14) Ry(S;c) = LR (Sa), 
(15) Ва (51а) =], Rn (S; c), 
(16) (S;a?)' Ri (S;a??) ^ 1, (Sc?) Ra (S; c?) =1. 


We can give these developments a geometric interpretation. The rows of 
the matrix (x,,..., xy) сап be interpreted as p vectors in an N-dimensional 
space, and the rows of (x, — x, ..., xy — X) are the p vectors projected on the 
(N — 1)-dimensional subspace orthogonal to the equiangular line. Denote 
these as xf,..., x*. Any vector u* with components a'(xP – Ж0,..., xi — 
х0) =a xf +- +a, xL is in the p,-space spanned by xT,...,x5, and a 
vector v* with components y'(x — x9... xi — x?) = ух, 
+++, x3 is in the p;space spanned by Xj epe Хр. The cosine of the 
angle between these two vectors is the correlation between и, = a'x(? and 
о = y'x2, а = 1,..., М. Finding а and y to maximize the correlation is 
equivalent to finding the vectors in the p,-space and the p,-space such that 
the angle between them is least (.е., has the greatest cosine). This gives the 
first canonical variates, and the first canonical correlation is the cosine of the 
angle. Similarly, the second canonical variates correspond to vectors orthogo- 


nal to the first canonical variates and with the angle minimized. 


12.3.2. Computation 


We shall discuss briefly computation in terms of the population quantities. 
Equations (50), (51), or (52) of Section 12.2 can be used. The computation of 
X,,2; X, can be accomplished by solving Ў, = ХЕ for X; X, and 
then multiplying by £j, If p, is sufficiently small, the determinant 
[Z,3; €, —»X,| can be expanded into a polynomial іп v, and the 
polynomial equation may be solved for у. The solutions are then inserted 
into (51) to arrive at the vectors a. | 

In many cases p, is too large for this procedure to be efficient. Then one 
can use an iterative procedure 


(17) УУ Ena a(i) = А? (1+1) а (1+1), 
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starting with an initial approximation o(0) the vector ai + 1) may be 
normalized by 


(18) a (1+ 1) Xjo(i* 1) 71. 


. 9) G 
The А2 1) converges to A? and a + 1) converges to o? Gf A, > M 
This can be demonstrated in à fashion similar to that used for princip 


components, using 
- -1 
(19) BY р Za = ANA 
f Section 12.2. See Problem 12.9. xl l . 
от 5н hand side of (19) is DP, А2009", where &O' is the ith row of 


: D — А-! 
A^. From the fact that A'Z, A = /, we find that AE, = A^! and thus 
aE, = 0". Now 


Pi | 
- -1 — Xa DRO = Da 
(20) Xy 1222 Ba - Мо 9 L QUA; 


і=2 
оо 0 
2... 
Dal. " : A^. 
оо ... | X) 


H . Lr a 2 is 
The maximum characteristic root of this matrix is A5. If M led 
iterati i in А2 ocedure is c 
matrix for iteration, we will obtain А? and а. The pr 


у А © аз desired. 

to find as many A; and а" as des 

Given А, and oO we find y“ from Ija = МХ т or equivalently 
1 , 


(1/103! Eja? = ү. А check on the computations is provided by com- gi 


paring jy“ and AjXg4«0. 


. H e № . ed E 
For the sample we perform these calculations with №; or Si j и 2i 
for X,. It is often convenient to use В; in the computation , F 
-1 <r. < 1) to obtain 51а“) and S,c); from these a“ and c can be 
у 


computed. 


Modern computational procedures are available for canonical correlations 4 


imi inci nts. Let 
and variates similar to those sketched for principal compone р 


= zü 
(21) Z= (x9 -0,..., х0 f ›), 


(22) Z, = (х0 —30,..., (P - x). 
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The QR decomposition of the transpose of these matrices (Section 11.4) is 
Z; - Q,R;, where Q;Q; — I, and К, is upper triangular. Then S;; = 2,2; = 
RiQ;Q,R;, i,j = 1,2, and S; = R;R;, і = 1,2. The canonical correlations are 
the singular values of 010, and the square roots of the characteristic roots of 
(Q,Q,XQ.Q,) (by Theorem 12.2.2). Then the singular value decomposition 
of 010, is РИ, O)T, where P and T are orthogonal and L is diagonal. To 
effect the decomposition Householder transformations are applied to the left 
and right of 0:0, to obtain an upper bidiagonal matrix, that is, a matrix with 
entries on the main diagonal and first superdiagonal. Givens matrices are 
used to reduce this matrix to a matrix that is diagonal to the degree of 
approximation required. For more detail see Kennedy and Gentle (1980), 
Section 7.2 and 12.2, Chambers (1977), Bjórck and Golub (1973), Golub and 
Luk (1976), and Golub and Van Loan (1989). 


12.4. STATISTICAL INFERENCE 


12.4.1. Tests of Independence and of Rank 


In Chapter 9 we considered testing the null hypothesis that X? and XO are 
independent, which is equivalent to the null hypothesis that 2, = 0. Since 
A'X,4T —(A 0), it is seen that the hypothesis is equivalent to A = 0, that is, 


р, = +| = р, =0. The likelihood ratio criterion for testing this null hypothe- 
sis is the N/2 power of 


where л = > ++ zr, =l, > 0 аге the p, possibly nonzero sample canoni- 


cal correlations. Under the null hypothesis, the limiting distribution of 


Bartlett’s modification of —2 times the logarithm of the likelihood ratio 
criterion, namely, 


Q ~IN- 3o 3] 01-0), 
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is x^ with рур, degrees of freedom. (See Section 9.4.) Note that it is 
approximately : 


Pi 
(3) N ir? -NtrAj АА, 


i=l 


which is N times Nagao’s criterion [(2) of Section 9.5]. 

If Х #0, an interesting question is how many population canonical 
correlations are different from 0; that is, how many canonical variates are 
needed to explain the correlations between X! and X? The number of 
nonzero canonical correlations is equal to the rank of У,2. The likelihood 
ratio criterion for testing the null hypothesis Ни: py =. = Pp, = 0, that is, 
that the rank of 5; is not greater than k, is T1?!,,,(1 — 72)?N [FujikosFi 
(1974). Under the null hypothesis | 


(4) -[N - i(p +3)] Э log(1— r2) 


f=k+t 


has approximately the y?-distribution with ( pi —kX p; — К) degrees of free- 
dom. [Glynn and Muirhead (1978) suggest multiplying the sum in (4) by 
N —k— $(p +3) + УХ (1/72); see also Lawley (1959).] 

To determine the numbers of nonzero and zero population canonical 
correlations one can test that all the roots are 0; if that hypothesis is rejected, 
test that the p, — 1 smallest roots are 0; etc. Of course, these procedures are 
not statistically independent, even asymptotically. Alternatively, one could 
use a sequence of tests in the opposite direction: Test Pp, = 0, then DBp,-17 
Pp, = 0, and so on, until a hypothesis is rejected or until X12 = 0 is accepted. 
Yet another procedure (which can only be carried out for small D) is to test 
Pp, = 0, then p, _, — 0, and so forth. In this procedure one would use rj to 


test the hypothesis р,=0. The relevant asymptotic distribution will be 
discussed in Section 12.4.2. 


12.4.2. Distributions of Canonical Correlaticns 


The density of the canonical correlations is given in Section 13.4 for the case 
that £, = 0, that is, all the population correlations аге 0. The density when 
some population correlations are different from 0 has been given by Constan- 
tine (1963) in terms of a hypergeometric function of two matrix arguments. 

The large-sample theory is more manageable. Suppose the first k canoni- 
cal correlations are positive, less than 1, and different, and suppose that 
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Pı k correlations are 0. Let 


2_ 2 
z,-/N—2 i=1.....k. 
(5) 2p;(1 — р?) 
z; = Nr}, P=kK+1,....p). 
Then in the limiting distribution z,,...,z, and the set 2), кем ATE 
mutually independent, z; has the limiting distribution МО, 1), i= 1... k. 
and the density of the limiting distribution of z,,;,...,z, is 


10р к) 1 
qi ” ехр(-3 Baz) 


Pi 


Pi 
П ет" П (G2. 
i=k+1 i.j=k+1 


ixj 


This is the density (11) of Section 13.3 of the characteristic roots of a 
(p, — k)-order matrix with distribution WU, _,, p; — k). Note that the nor- 
malizing factor for the squared correlations corresponding to nonzero popu- 
lation correlations is УМ, while the factor corresponding to zero population 
correlation is N. See Chapter 13. . .. 

In large samples we treat r? as М pj, (4/NM p; (1 — pj] ог rj as 
М pp А/М - р2)2] (by Theorem 4.2.3) to obtain tests of р; or confidence 
intervals for o Lawley (1959) has shown that the transformation 2; == 
ranh^!(rj) [see Section 4.2.3] does not stabilize the variance and has a 
significant bias in estimating ¢, = tanh ^! ( pj). 


12.5. AN EXAMPLE 


In this section we consider a simple illustrative example. Rao [(1952), p. 245] 
gives some measurements on the first and second adult sons in a sample of 25 
families. (These have been used in Problem 3.1 and Problem 4.41.) Let x,, 
be the head length of the first son in the oth family, x5, be the head breadth 
of the first son, x4, be the head length of the second son, and +,, be the 
head breadth of the second son. We shall investigate the relations between 
the measurements for the first son and for the second. Thus x = (х.х) 


lat à 
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and x’ = (х.х). The data can be summarized ast 


(1) X' —(185.72.151.12, 183.84, 149.24), 


95.2933 52.8683 69.6617 46.1117 

_ | 52.8683 54.3600 51.3117 35.0533 | |S$u $n 

5 | 69.6617 513117 100.8067 56.5400 | |S4 Sal 
46.1117 35.0533 56.5400 45.0233 


The matrix of correlations is 


1.0000 0.7346 | 0.7108 0.7040 


0.7346 1.0000 | 0.6932 0.7086 [2 Ry 


1 
9 = |- =-= ОО ИИ = 
(2) = [75710877 0.693277 1.0000 0.8392 


0.7040 0.7086 ! 0.8392 1.0000 


All of the correlations are about 0.7 except for the correlation between the 
two measurements on second sons. In particular, R, is nearly of rank one, 
and hence the second canonical will be near zero. 

We compute 


3332 
(3) Rs Rs = (05970 0.333 » 


0.363480 0.428976 


0.544311 pM] 


-1 == 
(4) RyRy Ry = оаа 0.534 950 


The determinantal equation is 


0.544311 — 1.0000» 0.538 841 — 0.7346v 
0.538841 — 0.7346» 0.534950 — 1.0000» 


— 0.460 363v? — 0.287 596» 4- 0.000 830. 


(5) o-| 


The roots are 0.621816 and 0.002 900; thus /, = 0.788 553 and /, = 0.053 852. 
Corresponding to these roots are the vectors 


0.552166 А 1.366 501 
ә = ‚ Sa®= , 
(6) Sia 0.521548 v — 1.378467 


y. 0 97618 0 
(7) “| о ss | | 0 73729 |` 


* Rao's computations are in error: his last "difference" is incorrect. 
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We apply (1/1) Rz R} to S,a to obtain 


0.504511 1.767 281 

(Da D= 

(8) Sze (05431). $c (1767281), 

where 

(9) s.p» ° - (10.0402 0 | 
о у 0 6.7099]: 


. We check these computations by calculating 


a9 теат a(S) = [0358], TER Rale) rona) 
The first vector in (10) corresponds closely to the first vector in (6); in fact, it 
is a slight improvement, for the computation is equivalent to an iteration on 
51а. The second vector in (10) does not correspond as closely to the second 
vector in (6). One reason is that /, is correct to only four or five significant 
figures (as is v, = /2) and thus the components of S,c® can be correct to 
only as many significant figures; secondly, the fact that S,c? corresponds 
to the smaller root means that the iteration decreases the accuracy instead of 
increasing it. Our final results are 


(1) - (2) 
1, = 0.789, 0.054, 
» _ { 0.0566 0.1400 
(i) = 
aD M (oo; | вт] 
ce = | 00502. | 0.1760 | 
0.0802 " —0.2619 /' 


The larger of the two canonical correlations, 0.789, is larger than any of 
the individual correlations of a variable of the first set with a variable of the 
other. The second canonical correlation is very near zero. This means that to 
study the relation between two head dimensions of first sons and second sons 
we can confine our attention to the first canonical variates; the second 
canonical variates are correlated only slightly. The first canonical variate in 
each set is approximately proportional to the sum of the two measurements 
divided by their respective standard deviations; the second canonical variate 


in each set is approximately proportional to the difference of the two 
standardized measurements. 
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12.6. LINEARLY RELATED EXPECTED VALUES 


12.6.1. Canonical Analysis of Regression Matrices 


In this section we develop canonical correlations and variates for one 
stochastic vector and one nonstochastic vector. The expected value of the 
stochastic vector is a linear function of the nonstochastic vector (Chapter 8). 
We find new coordinate systems so that the expected value of each coordi- 
nate of the stochastic vector depends on only one coordinate of the non- 
stochastic vector; the coordinates of the stochastic vector are uncorrelated in 
the stochastic sense, and the coordinates of the nonstochastic vector are 
uncorrelated in the sample. The coordinates are ordered according to the 
effzct sum of squares relative to the variance. The algebra is similar to that 
developed in Section 12.2. 

If X has the normal distribution N(p, €) with X, p, and X partitioned as 
in (1) and (2) of Section 12.2 and p = (и), pw)’, the conditional distribu- 
tion of ХЧ given x is normal with mean 


(1) pO +B(x®—p®),  B-X,ZEj, 


and covariance matrix 


= -1 
(2) Z122 = Ë - 2% Хд. 
Since we consider a set of random vectors X(9,..., ХФ with expected values 
depending on xí?,...,xí2 (nonstochastic) we can write the conditional 


expected value of Xj? as т+ B(x® — xO), where т= p? + B(x? — р) 
can be considered as a parameter vector. This is the model of Section 8.2 
with a slight change of notation. 

The model of this section is 


(3) éXp-B(xp-x9?) — $-1,...,N, 


where x(?,...,xf? are a set of nonstochastic vectors (q X 1) and xO = 
N'EN x®. The covariance matrix is 


© e(xp - exp - exp = W. 


Consider a linear combination of ХХ), say U, =а'Х®. Then U, has 
$ y Uy $ $ 
variance a’W« and expected value 


(5) EU, = a't +a Bh xP — x9). 
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The mean expected value is ОА) у ФИ, = а'т, and the mean sum of 
squares due to x is 


1 2 18€ , Q . z2M LO — zOYyg: 
(6) - У (4U, - a'4) 1 È a'B(x? —х yos —Xx )B'a 
$-1 


a'BS;B'o. 


We can ask for the linear combination that maximizes the mean sum of 
squares relative to its variance; that is, the linear combination of dependent 
variables on which the independent variables have greatest effect. We want 
to maximize (6) subject to a’ а = 1. That leads to the vector equation 


(7) (BS_,B' -кФ)а=0 
for к satisfying 
(8) IBS; B' -x Y| = 0. 


Multiplication of (7) on the left by а’ shows that а 'В5,,В'о = к for а and 
к satisfying a’Wa = 1 and (7); to obtain the maximum we take the largest 
root of (8), say ки. Denote this vector by a”, and the corresponding random 
variable by U,, =a’ XP. The expected value of this first canonical vari- 
able в £U, =a [B(x? – х0) + 7]. Let а0'В — ky, where k is 
determined so 


2 


N 

1 (VQ 1 ro? 

(9) 1 = " y хф — N У y xe 
n=l 


yxp - 22)( x у 


П 1 
= yO Sy Y! ›. 


Then k= үкт. Let U,, = y (x — 2), Then Uy, = vic; uf + an. 
Next let us obtain a linear combination О, = e'X$? that has maximum 
effect sum of squares among all linear combinations with variance 1 and 
uncorrelated with U,,, that is, 0 = &(U, — EU XU, s - EU, 5)’ = a Pa, 
As in Section 12.2, we can set up this maximization problem with Lagrange 
multipliers and find that œ satisfies (7) for some x satisfying (8) and 
а'Ҹа = 1. The process is continued in a manner similar to that in Section 
122. We summarize the results. l l и 
The jth canonical random variable is U,, = o 'X$?, where aU satisfies 
(7) for к= x; and aO Wa =1; k«,z «2 c z кр, are the roots of (8). 
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We shall assume the rank of В is р, < ро. (Then к, > 0.) U;, has the largest 
effect sum of squares of linear combinations that have unit variance and are 
uncorrelated with U,4,...,Uj-1, в. Let y? = (1/ ук; Bia, v, 7 (т, and 
оь = y (xf? — x). Then 


(10) ФО = Vj Yig + vj, 
ix ix 2 
(11) = L |Y N 5 | =L 
$-1 n=l 
N 1 А 1 х 
(12) У US N È vallee- N 2 Vin | =0, із}. 
$=1 n=l n=! 
If p> ру, then y^ *9,..., y ^? can be chosen so 0, +i, 4 = yt) — 


E), ...,0 T Y^ (x — x) satisfy ay and (12), 

Let AS КИ або, Г, = (SO e y PD), Г, = (y +D ye), Д = 
diag($,,...,6,) = аву, es V Ks, . U, = A'XD, oP = Tj (a —-x9), vf 
= FG — 307), and v, = WPP Y, ф=Ъ... и Then 


(13) 4 (U, - €U,)(U, — EUY = A"WA-I. 


(14) ФИ, = Avi? + v, $-1,..., N, 
1 N 1 N 1 N 
(15) 7 У mE У „а-у >. в, | 7I. 
1 


The random canonical variates are uncorrelated and have variance 1. The 
expected value of each random canonical variate is a multiple of the corre- 
sponding nonstochastic canonical variate plus a constant. The nonstochastic 
canonical variates have sample variance 1 and are uncorrelated in the 
sample. 

If p, > pa, the maximum rank of B is p; and k,,,, = ~ = Kp, = 0. In that 
case we define A, = (a(0,..., a (72) and А, = (a(*D ...,а(22)), where 
a'),...,a%2 (corresponding to positive к”) are defined as before and 
oP: |. аб?" are any vectors satisfying aU Wa = 1 and a Фа = 
0, i*j. Then EUP = Sv) +v, i-L...,p,, and &Uf'—v, і=р + 
1,....ру. 

In either case if the rank of B is r < min( p,, p2), there аге r roots of (8) 
that are nonzero and hence 20? = §v{ + v; for i=1,...,7 
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12.6.2. Estimation 


Let x{,...,x@ be a set of observations on X(9,..., ХФ with the probability 
structure developed in Section 12.6.1, and let x(9,..., x? be the set of 
corresponding independent variates. Then we can estimate т, B, and W by 


. DN 

(16) #1 Lae, 
ф=1 

(17) В = 4,451 = 8185, 


(18) Ф-1 Ў [x9 - x0 В (x —2)] о-о - B (о-о) 


1 - 
= (41 - Ар АА л) = $4 - $555 $5, 


where the A’s and 575 are defined as before. (It is convenient to divide by 
п= № 1 instead of by №; the latter would yield maximum likelihood 
estimators.) 

The sample analogs of (7) and (8) are 


(19) 0 = (Bs; B' — «а 
= [$12525 Sa —k(S,, - 815518), 
(20) 0 = ІВ5„В' = КФ | 


= 15555 Sa = (5, = 518215). 


The roots k, > z k,, of (20) estimate the roots ку2 + 2 к, of (8), and 
the corresponding solutions aC, ..., 200) of (19), normalized by a’ Pa = 1, 
estimate a™,..., о(РО. Then орау уе, В’ a estimates ү, and n; 
az) estimates v, The sample canonical variates are a)'x and zov 
— x0) j21,.., py $=1,..., N. If p, >po, then p, —p; more 20° can 
be defined satisfying aya = 1 and avra = 0, і +]. 


12.6.3. Relations Between Canonical Variates 


In Section 12.3, the roots Д > + > Г, were defined to satisfy 


-ISa Sy 


21 0 
( ) Sz -1$, 


= (- ))"n- ^|$,|$;, $5; $5 — P8, |. 
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Since (20) can be written 
(22) 0 - |(1 4 &) $5525; — би |, 


we see that 1? = k;/(1 + k;) and k; = 2/1 —12), i=1,..., pı. The vector a 
in Section 12.3 satisfies 


(23) 0 = ($557 Sn — 175, a? 
>! ki (@) 
= | $1252 $5, - 1 +k, Tae Su |9 
1 > - i 
= Tek [5255 5 = (5 — $1255 Sh ) а , 


which is equivalent to (19) for k= k; Comparison of the normaliza- 
tions a’S,,a=1 and (S — SaS Sa a =1 shows that a= 
G/yi-# дико. Then 2%) -/ Es; 8,12) = cU), 

We see that canonical variable analysis can be applied when the two 
vectors are jointly random and when one vector is random and the other is 
nonstochastic. The canonical variables defined by the two approaches are the 
same except for normatization. The measure of relationship between corre- 
sponding canonical variables can be the (canonical) correlation or it can be 
the ratio of “explained” to “unexplained” variance. 


12.6.4. Testing Rank 


The number of roots к; that are different from 0 is the rank of the regression 
matrix B. It is the number of linear combinations of the regression variables 
that are needed to express the expected values of X. We can ask whether 


the rank is k (1<k <p, if p, <p.) against the alternative that the rank is 
greater than k. The hypothesis is 


(24) Ни: Kgy = EK = 0). 


The likelihood ratio criterion [Anderson (1951b)] is а power of 
Рі 
(25) П (1+5) = II (+172). 
i=k+1 i=k+1 


Note that this is the same criterion as for the case of both vectors stochastic 
(Section 12.4). Then 


Pi 
(26) -[N-i(p*3) X log(1- 7) 


i=k+1 
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has approximately the y?-distribution with (p; — kX p.—k) degrees of 
freedom. 

The determination of the rank as any number between 0 and p, can be 
done as in Section 12.4. 


12.6.5. Linear Functional Relationships 


The study of Section 12.6 can be carried out in other terms. For example, the 
balanced one-way analysis of variance can be set up as 


(27) Ү, =. tpt U, 


aj? 
where &U, = 0, 20,0! = WV, Уту, = 0, and | 
(28) Ov, = 0, а=1..... т, 


where Ө is д X p, of rank q (<p,). This is a special case of the model of 
Section 12.6.1 with $ — 1,..., №, replaced by the pair of indices (а, j) 
XP =Y, т= р, and pa? 30) = v, by use of dummy variables as in 
Section 8.8. The rank of (v,,..., Vm) is that of B, namely, r= p, – 9. There 
are q roots of (8) equal to 0 with 


(29) В5,В' = P m 


The model (27) can be interpreted as repeated observations on v, + p with 
error. The component equations of (28) are the linear functional relation- 
ships. - 

Let y, = (1/DE} -1 Jaj and ӯ = (L/n)Y о -, Ja. The sum of squares for effect 
is 


(30) H-1Y (X, -3)(.-3) = nBSoB 
with m — 1 degrees of freedom, and the sum of squares for error is 


Gn G= Y Y Ou- -X)Q4-X) = пФ 


а=1 }=1 


with т(- 1) degrees of freedom. The case p, <p, corresponds to pi «I. 
Then a maximum likelihood estimator of ® is 


(32) Ó - (a... ay, 
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and the maximum likelihood estimators of v, are 

(33) $,- ¥6'O(5, — y»), а= 1,...,п. 
The estimator (32) can be multiplied by any nonsingular 4 X q matrix on the 


left to obtain another. For a fuller discussion, see Anderson (1984a) and 
Kendall and Stuart (1973). 


12.7. REDUCED RANK REGRESSION 


Reduced rank regression involves estimating the regression matrix В in - 


£XO|XO = BX® by a matrix B of preassigned rank К. In the limited-infor- 
mation maximum likelihood method of estimating an equation that is part of 
a system of simultaneous equations (Section 12.8), the regression matrix is 
assumed to be of rank one less than the order of the matrix. Anderson 
(1951a) derived the maximum likelihood estimator of B when the model is 


(1) ХФ -24-B(xXD-x?)-Z,, | a-L...N, 


the rank of B is specified to be К (<p,), the vectors xQ,..., x? are 
nonstochastic, and Z, is normally distributed. On the basis of a sample 
xX... Xy, define Ê by (2) of Section 12.3 and A, А, and Г by (3), (4), and 
(5). Partition A = diag( À , A,), A= (A, Az), and Г = (T, E), where A,, 
À,, and Г, have k columns. Let Ф, = А, - А2) $. 


Definition 12.7.1 (Reduced Rank Regression) The reduced rank regressicn 
estimator in (1) is 


(2) B, = S, , P Âi = Syy Â Îi = E22 ФИФ, В, 
where B = €,,$ 5! and $,;- $ - BEB". 


The maximum likelihood estimator of B of rank k is the same for X and 
ХӘ) normally distributed because the density of X = (X, ХОУ factors as 


(3 пои, E)-7 n(x p? + B(x” — 2), X;;)n(x?l pO, X). 


Reduced rank regression has been applied in many disciplines, including 
econometrics, time series analysis, and signal processing. See, for example, 
Johansen (1995) for use of reduced rank regression in estimation of cointe- 
gration in economic time series, Tsay and Tiao (1985) and Ahn and Reinsel 
(1988) for applications in stationary processes, and Stoica and Viberg (1996) 
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for utilization in signal processing. In general the estimated reauced rank 
regression is a better estimator in a regression model than the unrestricted 
estimator. 

In Section 13.7 the asymptotic distribution of the reduced rank regression 
estimator is obtained under the assumptions that are sufficient for the 
asymptotic normality of the least squares estimator B — $5 $3. The asymp- 
totic distribution of B, has been obtained by Ryan, Hubert, Carter, Sprague, 
and Parrott (1992), Schmidli (1996), Stoica and Viberg (1996), and Reinsel 
and Velu (1998) by use of the expected Fisher information on the assumption 
that Z, is normally distributed. Izenman (1975) suggested the term reduced 
rank regression. 


12.8. SIMULTANEOUS EQUATIONS MODELS 


12.8.1. The Model 


Inference for structural equation models in econometrics is related to canoni- 
са] correlations. The general model is 


(1) By, + Га, =и,, t=1,...,T, 


where B is GXG and Г is GXK. Here y, is composed of С jointly 
dependent variables (endogenous), z, is composed of K predetermined 
variables (exogenous and lagged dependent) which are treated as "indepen- 
dent” variables, and u; consists of G unobservable random variables with 


(2) би, = 0, Фиш = X. 


We requirc B to be nonsingular. This model was initiated by Haavelmo 
(1944) and was developed by Koopmans, Marschak, Hurwicz, Anderson, 
Rubin, Leipnik, et al., 1944-1954, at the Cowles Commission for Research in 
Economics. Each component equation represents the behavior of some group 
(such as consumers or producers) and has economic meaning. 

The set of structural equations (1) can be solved for y, (because B is 
nonsingular): 


(3) y, = Wiz, +0, 

where 

(4) I--B^7T, v,=B'u, 
with 


(5) 61-0 Evy =B E(B) =Q, 
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say. The equation (3) is called the reduced form of the model. It is a 
multivariate regression model. In Principle, it is observable. 


12.8.2. Identification by Specified Zeros 


The structural equation (1) can be multiplied on the left by an arbitrary 
nonsingular matrix. To determine component equations that are economi- 
cally meaningful, restrictions must be imposed. For example, in the case of 
demand and supply the equation describing demand may be distinguished by 
the fact that it includes consumer income and excludes cost of raw materials, 
Which is in the supply equation. The exclusion of the latter amounts to 
specifying that its coefficient in the demand equation is 0. 

We consider identification of a structural equation by specifying certain 
coefficients to be 0. It is convenient to treat the first equation. Suppose the 
variables are numbered so that the first G, jointly dependent variables are 
included in the first equation and the remaining G, = С — С, аге not and 
the first К, predetermined variables are included and K,=K-—K, are 
excluded. Then we can partition the coefficient matrices as 


L 


(6) (В г) - [В оу °] 


where the vectors В, 0, y, and 0 have G,, G,, K,, and К, components, 
respectively. The reduced form is partitioned conformally into G, and С, 
sets of rows and K, and K, sets of columns: 


(7) п- ln au 


The relation between B, Г, and II can be expressed as 


(8) 

|" 0 |-r- -n= -[* 0 | m, ШЕ eme 
The upper right-hand corner of (8) yields 

(9) В'П,, = 0. 


To determine B (С, x 1) uniquely except for a constant of proportionality we 
need 


(10) rank(II,,) = С, — 1. 
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This implies 
(11) K,2G,—1. 


Addition of С, to (11) gives the order condition 
(12) | G,+K,2G,+G,-1=G-l. 


The number of specified 0’s in an identified equation must be at least equal 
to 1 less than the number of equations (or jointly dependent variables). 

It can be shown that when B is nonsingular (10) holds if and only if the 
rank of the matrix consisting of the columns of (B Г) with specified 0’s in the 
first row is G — 1. 


12.83. Estimation of the Reduced Form 


The model (3) is a typical multivariate regression model. The observations 
are 


MI Ут 
(13) a Poolat 
The usual estimators of II and Q (Section 8.2) are 
T T =! 
(14) P= Enzl Y zi ; 
t 
~ te ; 
(15) Ô = т Y o - Pu)G, Pz) . 


These are maximum likelihood estimators if the v, are normal. 
If the z, are exogenous (regardless of normality), then 


(16) фуесР=уес П, | é(vecP) - 4^! 8 Q, 
where 
Э 
A= ) zz 
(17) Lu 


and vec(d,,..., dm) = (dj,..., d;,)'. И, furthermore, the г, are normal, then 
P is normal and ТО has the Wishart distribution with covariance matrix 0 
and T — K degrees of freedom. 
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12.8.4. Estimation of the Coefficients of an Equation 


First, consider the estimation of the vector of coefficients В when К, = 
С, — l. Let 


Py Py 
Pa Py 


(18) P-| 


be partitioned as П. Then the probability is 1 that rank(P,,.) = С, — 1 and 
the equation 


(19) ЁР =0 


has a nontrivial solution that is unique except for a constant of proportional- 
ity. This is the maximum likelihood estimator when the disturbance terms are 
normal. 

If K, > G,, then the probability is 1 that rank( P5) = С; and (19) has only 
the trivial solution В = 0, which is unsatisfactory. To obtain a suitable 
estimator we find В to minimize Q'P,, in som: sense relative to another 
function of un 

Let z, be partitioned into subvectors of К, апа K, components: 


А zi) 
(20) “=| a): 


T A A 
и 12 
(21) у: 1 = А = , 
t=1 t A, An 
(22) Án; = 45 - An Aj Ay. 


Let y, and © be partitioned into С, and С, components: 


(1) 
у 
(23) у, = s 
О, 91; 
(24) Q= 2 1. 
о, О, 
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Now set up the multivariate analysis of variance table for у: 


Source Sum of Squares 
T 
1 (1 1 -1, 1 
10 Y yP As 2 yO" 
s,t=1 
2 1 t 
2102120 РАР} 
T 
Error X Gf? - Puz = Розу — Риз = Pyzf?y 
t=1 
T 
Total Y yy 
t=1 


The first term in the table is the (vector) sum of squares of y® due to the 
effect of 200. The second term is due to the effect of 20) beyond the effect of 
2. The two add to (PAP’),,, which is the total effect of z,, the predeter- 
mined variables. 

We propose to find the vector В such that effect of z? and B’y™ beyond 
the effect of z® is minimized relative to the error sum of squares of B’y. 
We minimize 
(25) Ê (PS2 Pi)Ê _ (B'Pa) бы. (B^ Pia) 

ВОВ p'Oup 


where ТО = EI Qy,y,— РАР'. This estimator has been called the Jeast vari- 
ance ratio estimator. Under normality and based only on the 0 restrictions on 
the coefficients of this single equation, the estimator is maximum likelihood 
and is known as the limited-information maximum likelihood (LIML) estimator 
[Anderson and Rubin (1949)]. 


The algebra of minimizing (25) is to find the smallest root, say v, of 
(26) | P$, Pi; -àQ ; = 0 
and the corresponding vector satisfying 


(27) Роба Pf = ХО 


uU» 


The vector is normalized according to some rule. A frequently used rule is to 
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set one (nonzero) coefficient equal to 1, say the first, B, — 1. If we write 


1 ^ 1 
28 = , = Ox |> 
[7a [Pn 
(29) | По m (ag) Ри m Ps , 
(30) = |" 2 
11 Фи) ôr > 


then (27) can be replaced by the linear equation 
(31) (Ph S, PH- vÂ% )ê* = - (Pý 51р – và). 


The first component equation in (27) has been dropped because it is linearly 
dependent on the other equations [because v is a root of (26)]. 


12.8.5. Relation to the Linear Functional Relationship 


We now show that the model for the single linear functional relationship 
(q — 1) is identical to the model for structural equations in the special case 
that С, = 0 (y(? =y,) and 20) =1 (K, = 1). Write the two models as 


(32) Xaj =M +v, tU, а= 1,...,п, j=1,...,k, 
where 
n 
(33) L ъ,= 0, 
а=1 
and 
(34) у= П + Iz? +0, 1=1,...,Т, 


where П = (П, IL;). The correspondence between the models is p 6 С = С), 


(35) Xaj 2 Yo U, 2 Vo 
(36) (а, Ј) ot, nkeT, 
(37) won, pe Ш. 


We can write the model for the linear functional relationship with dummy 
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variables. Define 


0 
(38) Saj = 1 < ath position, a-],....n— 1. 
0 
—1 
(39) $4j = : 
-i1 
Then 
l =1 n 
(40) pt vum (в. vs) Saj , а »tttsfts 
where j may be suppressed. Note 
(41) v, = (и, ton +). 
The correspondence is 
(42) 1 e zt, saj € ZP. 
(43) - po I, (Pireo mi) ө In. 
(44) leK; n-1°K,, 
(45) B(v,,...,v,.,) "0 BPT = 0. 


Let P = (P, Pj). In terms of the statistics we have the correspondence 


(46) В =, 


(47) $,-x,-x e P. 


The effect matrix is 


(48) H-k), (x,-x)(x,—-Xx) e P.A5,P:. 
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and the error matrix is 


Mar 


n k 
(49) G= X E (ra, (Xaj) e TÂ = 


а=] ј=1 р 


(у, — Ра) (у, - Ра,)'. 


Il 
- 


Then the estimator B of the linear functional relationship for 9 =1 is 
identical to the LIML estimator [Anderson (19516), (1976), (1984а)]. 


12.8.6. Asymptotic Theory as T > оо 


We shall find the limiting distribution of УТ(В* — В*) defined by (28) and 
(31) by showing that B* is asymptotically equivalent to 


=, б + -1 1 
(50) Tus = (РБ S2a PR) РЬ52.1Р12. 


This derivation is essentially the same as that given in Anderson and Rubin 
(1950) except for notation. The estimator defined by (50), known as the two 
stage least squares (TSLS) estimator, is an approximation to the LIML 
estimator obtained by dropping the terms và and và, from (31). Let 
Ê* = В м. We assume the conditions for УТОР — П) having a limiting 
normal distribution. (See Theorem 8.11.1.) 


Lemma 12.8.1. Suppose (1/T)A  A?, a positive definite matrix, as Т > 
oc. Then v = O,(1/T), where v is the smallest root of (26). 


Proof. Let P,, = VT (Pu — I). Then because B'I, = 0 


(51) >- — 
ВОВ p'À,p 
_ Роба Р'В o (1 
= - = 0,1% |. 
TR'O uB 
Since 
(52) v= min ЁР52 „Рб < В.Ро82.1РоВ 
6 ВОВ pp 
the lemma follows. || 
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The import of Lemma 12.8.1 is that the difference between the LIML 
estimator and the TSLS estimator is O,(1/T). We have 


(53) Bimi — Bris 
— nocd d 
= (255› Pg) PS». py 
A cl, 
- (Pb S.P – v) (Pb S5. p — vôn) 
= nol 0 А -1 
= [(РЪ8,, ot) - (PË Sy. PH — vÂ) [ар 
PAS Ea A 7l ^ 
*( 12 21Р -vO) vog 
= — ila Й A —1 
= —v( PS, Pi) Qu( P S5, PE - và) PS» py 
*v(P5S, .РЬ- và) ô 


~0,0)=0,(4), 


(1) 


Consider 
T T 
(54) potP5p'-P,»p = 45 Y z^ yg = 45. Y zu, 
#=1 t=1 
2.1) _ - , П 
where 229 — 2/2 — A, Ап! 2. Thus (р, + P B*) = EPB = 0 and 
(55) &( pip + PEB*)( pi, + PE *) = £P B(PRBY = 045). 


Note that Bigis — В* = -(P5S5 PH) !P5S.,PnB and (В’, 0)у, + 
(у',0)2, = ин. 


Theorem 12.8.1. Under the conditions of Theorem 8.11.1 
^ d = 
(56) УТ (В? м. = p*) > м, oy (11,2 55 ., II; ) ‘|. 


Proof. The theorem follows from (55), 5». ->5%., and Р, 5 Mp. 
a 
Because of the correspondence between the LIML estimator and the 


maximum likelihood estimator for the linear functional relationship as out- 
lined in Section 12.7.5, this asymptotic theory can be translated for the latter. 
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Suppose the single linear functional relationship is written as 


(57)  0-p'w-(1 в ONE "yt, а= 1,...,п, 
where 

Via 
(58) Е | а= 1,...,п. 


_ Let n (& К) be fixed, and let the number of replications k > œ (correspond- 
ing to T/K > œ for fixed К). Let c? = p'wg. 


Since IE, A». IF, corresponds to КУ" vë vž', В* here has the approxi- 
mate distribution 


(59) N 


n -1 
eo: У ur | 
a=} 


Although Anderson and Rubin (1950) showed that vt, and và, could 
be dropped from (31) defining Bf, and hence that Вх was asymptoti- 
cally equivalent to f 44 <, they did not explicitly propose Ваз. [As part of 
the Cowles Commission program, Chernoff and Divinsky (1953) developed a 
computational program of В, м: | The TSLS estimator was proposed by 
Basmann (1957) and Theil (1961). It corresponds in thc linear functional 
relationship setup to ordinary least squares on the first coordinate. If some 
other coefficient of В were set equal to one, the minimization would be in 
the direction of that coordinate. 

Consider the general linear functional relationship when the error covari- 
ance matrix is unknown and there are replications. Constrain B to be 


(60) B-(L, B*). 
Partition 

yo 
6 = a 
(61) va M 


Then the least squares estimator of B* is 


(62) В. =- Y ZO — ROD) zO xy : gO — gO M gO — ZO) 7 
al a X a ) У ( а (= x ) 


а= | 


128 SIMULTANEOUS EQUATIONS MODELS 525 


. E 
For n fixed and k > oo and В*5 > B* and 


n =I 
} (63) feft) | У ese Т4 


а= 1 


[See Anderson (1984b).] И was shown by Anderson (1951c) that the q 
smallest sample roots are of such a probability order that the maximum 
likelihood estimator is asymptotically equivalent, that is, the limiting distribu- 
tion of УЕ vec( BY, — B*) is the right-hand side of (63). 


12.8.7. Other Asymptotic Theory 


In terms of the linear functional relationship it may be moie natural to 
consider я > о and k fixed. When k = 1 and the error covariance matrix is 
oly, Gleser (1981) has given the asymptotic theory. For the simultaneous 
equations mod.1, the corresponding conditions are that K, c, T > æ, and 
K,/T approaches a positive limit. Kunitomo (1980) has given an asymptotic 
expansion of the distribution in the case of p = 2 and т=а=1. 

When n-»o, the least squares estimator (Le., minimizing the sum of 
squares of the residuals in one fixed. direction) is not consistent; the LIML 
and TSLS estimators are not asymptotically equivalent. 


12.8.8. Distributions of Estimators 


Econometricians have studied intensively the distributions of TSLS and 
LIML estimator, particularly in the case of two endogenous variables. 

Exact distributions have been given by Basmann (1961), (1963), Richardson 
(1968), Sawa (1969), Mariano and Sawa (1972), Phillips (1980), and Anderson 
and Sawa (1982). These have not been very informative because they are 
usually given in terms of infinite series the properties of which are unknown 
or irrelevant. 

A more useful approach is by approximating the distributions. Asymptotic 
expansions of distributions have been made by Sargan and Mikhail (1971), 
Anderson and Sawa (1973), Anderson (1974), Kunitomo (1980), and others. 
Phillips (1982) studied the Padé approach. See also Anderson (1977). 

Tables of the distributions of the TSLS and LIML estimators in the case 
of two endogenous variables have been given by Anderson and Sawa 
(1977), (1979), and Anderson, Kunitomo, and Sawa (19832). 

Anderson, Kunitomo, and Sawa (1983b) graphed densities of the maxi- 
mum likelihood estimator and the least squares estimator (minimizing in one 
direction) for the linear functional relationship (Section 12.6) for the case 
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p=2,m=q=1,V= c^ NI, and for various values of В, n, and 


(64) = — È (HTH) 


PROBLEMS 


124. (Sec. 12.2) Let z, 7 zi, 7 l, a 1,..., п, and B = В. Verify that a =F ip. 
Relate this result to the discriminant function (Chapter 6). 


12.2. (Sec. 12.2} Prove that the roots of (14) are real. 
12.3. (Sec. 12.2) 


(a) Let X (X XO), £X-0, 


EXX = Хи Хр 
X4 aj 

U-ao'XU, V=y'X®, £U? 21-7 4V?, where a and y are vectors. 

Show that choosing а and y to maximize £UV is equivalent to choosing 

о and y to minimize the generalized variance of (U И). 


(b) Let X' -(x(' xO' xO), ex - 0, 


Zn Хр Xs 
éXX'2X-|X4 Xm Хз |, 
Хи Za Ys 


U-o'XU, Vay ХО), W= BX, 20? = ФИ? = ФИ? =1. Consider 
finding а, у, В to minimize the generalized variance of (U, V, W). Show 
that this minimum is invariant with respect to transformations Х* = 


A,X, А #0. | | 
(c) By using such transformations, transform X into the simplest possible 
form. 


(d) In the case of X? consisting of two components, reduce the problem (of 
minimizing the generalized variance) to its simplest form. 


(e) In this case give the derivative equations. 
(f Show that the minimum generalized variance is 1 if and only if X,,- 0, 


X,,70, Хз = 0. (Note: This extension of the notion of canonical variates 
does not lend itself to a "nice" explicit treatment.) 
12.4. (Sec. 12.2) Let 


XO = AZ +Y”, 
xo = В7 + YO, 
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where У, YO, Z are independent with mean zero and covariance matrices I 

‘with appropriate dimensionalities. Let А =(a,,...,a,), В = (Ь;,...,Ь,), and 
suppose that A'A, В'В are diagonal with positive diagonal elements. Show 
that the canonical variables for nonzero canonical correlations are propor- 
tional to a, X, b; XO. Obtain the canonical correlation coefficients and ap- 
propriate normalizing coefficients for the canonical variables. 


12.5. (Sec. 12.2) Let Ay > Àz 2 + > A, > 0 be the positive roots of (14), where X, 
and £5, аге q Xq nonsingular matrices. 


(a) What is the rank of Z4? 


(b) Write TI2., A? as the determinant of a rational function of Хи, X5, Ea, 
and X. Justify your answer. 


(c) If À, 7 1, what is the rank of 
Xa Xn 9 
Ža ^E 


12.6. Sec. 12.2) Let Xi, - (1), +ge,€,, Хз =(1-AM,, ^ he, €, Xp = 


Ке „е, where -1/(p, - 1)<g <1, —1/(p; — 1) <А « I, and К is suitably 
restricted. Find the canonical correlations and variates. What is the appropri- 
ate restriction оп К? 


12.7. (Sec. 12.3) Find the canonical correlations and canonical variates between 
the first two variables and the last three in Problem 4.42. 


12.8. (Sec. 12.3) Prove directly the sample analog of Theorem 12.2.1. 


12.9. (Sec. 12.3) Prove that A?(i + 1) > A? and a(i 1) > a if о (0) is such that 
а '(0)5 оа «0. [Hint: Use € 5d Ea = AAA] 


12.10. (Sec. 12.6) Prove (9), (10), and (11). 


12.11. Let À, z Àj 2*7 BA, be the roots of |Z, — AZ;| = 0, where X, and X, are 
q ха positive definite covariance matrices. 


(a) What does A, =A, = 1 imply about the relationship of X, and X;? 
(b) What does A,» 1 imply about the relationships of the ellipsoids x'Xj!x 
—c and x'Xj!x-c? 


(c) What does Л, > 1 and A, <1 imply about the relationships of the ешр- 
soids x'Er'x = с and x'Xz!x— c? 


12.12. (Sec. 12.4) For q= 2 express the criterion (2) of Section 9.5 in terms of 
canonical correlations. 


12.13. Find the canonical correlations for the data in Problem 9.11. 


CHAPTER 13 


The Distributions of Characteristic 
Roots and Vectors 


13.1. INTRODUCTION 


In this chapter we find the distribution of the sample principal component 
vectors and their sample variances when all population variances are 1 
(Section 13.3). We also find the distribution of the sample canonical correla- 
tions and one set of canonical vectors when the two sets of original variates 
are independent. This second distribution will be shown to be equivalent to 
the distribution of roots and vectors obtained in the next section. The 
distribution of the roots is particularly of interest because many invariant 
tests are functions of these roots, For example, invariant tests of the general 
linear hypothesis (Section 8.6) depend on the sample only through the roots 
of the determinantal equation 


(1) (Bio — BY) Ан. (Во - BY)’ – 15| - 0. 


If the hypothesis is true, the roots have the distribution given in Theorem 
13.2.2 or 13.2.3. Thus the significance level of any invariant test of the 
general linear hypothesis can be obtained from the distribution derived in the 
next section. If the test criterion is one of the ordered roots (e.g., the largest 
root), then the desired distribution is a marginal distribution of the joint 
distribution of roots. 

The limiting distributions of the roots are obtained under fairly general 
conditions. These are needed to obtain other limiting distributions, such as 
the distribution of the criterion for testing that the smallest variances of 
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principal components are equal. Some limiting distributions are obtained for 
elliptically contoured distributions. 


13.2. THE CASE OF TWO WISHART MATRICES 


13.2.1. The Transformation 


Let us consider A* and B* (p хр) distributed independently according to 
W(X,m) and W(X, n) respectively (m, n > p). We shall call the roots of 


(1) |4* — IB*| 20 
the characteristic roots of A* in the metric of B* and the vectors satisfying 
(2) (4* — IB*)x* = 0 


the characteristic vectors of A* in the metric of B*. In this section we shall 
consider the distribution of these roots and vectors. Later it will be shown 
that the squares of canonical correlation coefficients have this distribution if 
the population canonical correlations are all zero. 

First we shall transform А* and B* so that the distributions do not 
involve an arbitrary matrix X. Let С be а matrix such that CLC’ = I. Let 


(3) A-CA*C', B=CB*C'. 


Then A and B are independently distributed according to W(I,m) and 
ЖА, n) respectively (Section 7.3.3). Since 


| A – 18| = |CA* C' — СВ*С'| 
=|C(A* —-IB*)C'| = ICI -| A* — IB* |-|C'|, 
the roots of (1) are the roots of 
(4) 14 – iB| = 0. 


The corresponding vectors satisfying 


(5) (A—IB)x -0 
satisfy 
(6) 0-C !(A-IB)x 
= С`'(СА*С’ - ICB* C')x 
= ( A* —IB*)C'x. 


Thus the vectors x* are the vectors C'x. 
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It will be convenient to consider the roots of 
(7) lA - f(A* B) =0 
and the vectors y satisfying 
(8) [4 —f(A- B)]» = 9. 
The latter equation can be written 
(9) 0= (4—34 –В)у = (0 -/)4 -f8]». 


Since the probability that f = 1 (i.e., that | -В| = 0) is 0, the above equation 


15 
(10) (a-r) =o 


Thus the roots of (4) are related to the roots of (7) by 1-f/ü —f) or 
f=1/Q +D, and the vectors satisfying (5) are equal (or proportional) to 
those satisfying (8). 

We now consider finding the distribution of the roots and vectors satisfy- 
ing (7) and (8). Let the roots be ordered f, >f.> + >f,> 0 since the 
probability of two roots being equal is 0 [Okamoto (1973)]. Let 


f 9 0 
0 Л 0 
(11) F- : : 
оо f, 


Suppose the corresponding vector solutions of (8) normalized by 


(12) у(4+В)у= 1 
аге ур. Ур. These vectors must satisfy 
(13) y( A * B)y; = 0, 


because у; Ау; = f, Y; (A +В)у; and y; Ay; =f, y( A + B)y;, and this can be only 
if (13) holds (f; # fj. 
Let the p x p matrix Y be 


(14) У= (Yp) 
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Equation (8) can be summarized as 

(15) AY — (А + B)YF, 
and (12) and (13) give 

(16) ` Y'(A-- B)Y-I. 
From (15) we have 

(17) Y'AY - Y' (A -- B)YF =F. 


Multiplication of (16) and (17) on the left by (Y’)~! and on the right by Y! 
gives 


A«*B-(Y) Y, 


(18) 
A=(Y) FY}. 
Now let Y^! = E. Then 
A+B=E'E, 
(19) А = Е'ЕЕ, 


B -E'(I—F)E. 


We now consider the joint distribution of E and F. From (19) we see that 
E and F define A and B uniquely. From (7) and (11) and the ordering 
А> + >f, we see that A and B define Е uniquely. Equations (8) for f= f; 
and (12) define y, uniquely except for multiplication by — 1 (.е., replacing y, 
by —y;). Since YE = I, this means that E is defined uniquely except that rows 
of E can be multiplied by — 1. To remove this indeterminacy we require that 


еп > 0. (The probability that е, = 0 is 0.) Thus E and F are uniquely defined 
in terms of А and B. 


13.2.2. The Jacobian 


To find the density of E and Е we substitute in the density of А and В 
according to (19) and multiply by the Jacobian of the transformation. We 
devote this subsection to finding the Jacobian 


9(А, B) 
(20) Bez) . 


Since the transformation from A and В to А and G =A +B has Jacobian 
unity, we shall find 


9(4,6) |_| (а, В) 
Q1) Pea |5 : 
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First we notice that if x, — f,(y,,..., Yn) а=1,...,п, is a one-to-one 
transformation, the Jacobian is the determinant of the linear transformation 


of. 
22 dx,— У, a 
( ) a L3», Lg » 


where dx, and dy, are only formally differentials (i.e., we write these as a 
mnemonic device). If f, (y,,..., Yn) is a polynomial, then 3f,/2yg is the 
coefficient of yj in the expansion of f,(y, +у,..., y, +n) fin fact the 
coefficient in the expansion of f.(y,...,.Yg i Ув + 8, Уват: У). 
The elements of A and G are polynomials in E and F. Thus the Cerivative of 
an element of A is the coefficient of an element of E* and F* in the 
expansion of (E + E*)'(F + F*)(E + E*) and the derivative of an element of 
G is the coefficient of an element of E* and F* in thc expansion of 
(Е+Е*)(Е + Е*). Thus the Jacobian of the transformation from A,G 
to E, F is the determinant of the linear transformation 


(23) dA = (dE) FE + E'(dF)E + E'F(dE), 
(24) dG = (dE)'E + E'(dE). 


Since A and С (dA and dG) are symmetric, only the functionally indepen- 
dent component equations above are used. 


Multiply (23) and (24) on the left by E'^! and on the right by E^! to 
obtain 


(25) E'^ (dA) E! = E'" (dE) F + dF + F(dE)E"!, 
(26) E'-(dG) E7! = E'" (dE)! + (dE) E^! 
It should be kept in mind that (23) and (24) are now considered as a linear 


transformation without regard to how the equations were obtained. 
Let 


(27) E' ^ (dA)E^! =dA, 
(28) E'^!(dG)E-! - aG, 
(29) (dE) E^! = dW. 
Then 

` (30) dA = (dWy'F -- dF +F(dW), 


(31) dG = dW' + aW. 
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The linear transformation from dE, dF to dA,dG is considered as the linear 
transformtion from dE,dF to dW,dF with determinant JET! = ЕР 
(because each row of dE is transformed by E^), followed by the linear 
transformation from dW, dF to dA, dG, followed by the linear transformation 
from dA,dG to dA = E'(dA)E,dG =E'(dG)E with determinant Ept. 
|E|?*! (from Section 7.3.3); and the determinant of the linear transformation 
from dE,dF to dA,dG is the product of the determinants of the three 


" component transformations. The transformation (30), (31) is written in com- 


ponents as 
аа = df, + 2f; dwin 
Е t , аа, = f; dw; + f, dw;j, i<j, 
(3 
* 02 di; = 2dw;, 
ав; = dw; + dw;;, i<j. 


The determinant is 


df, dw; dw, (i<j) dw; (> j) 


dà; I 2F 0 0 
(33) dg; 0 21 0 0 
dà; (i<j) | 0 M N 
di; (i<j). | 0 I 1 
|I 2F id "| ani -м, 
0 21||1 1 
where 
(34) dwp cc dwp, dwy №, o dWp-ip 
1 
wa [noo ojo v t] jg 
: | 1 C. 
: | i o 
dai, 0 fi ] 0 ... 0 ! ! 0 
== Lo- i- 
day, 0 0 1f 01 | 0 
M- ! Dod E 
i Dod : 
1 5 | | 0 
dn, Q0 coc 019 7 HERUM 
o | | 
l222.--]p------7--77275 Jena nna H---- 1---- 
дар p 0 0 р 0 0 ! RE 


534 THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 
and 
(35) dwa; dw, dwz dw dw, p-1 
ET ! ! 0 
day, f- 0 0 0 ! ! 
А ! . 00 bo. 
: \ . l ! . 
1 
da,, 0 f, ! 0 aS 4212. 
da, |0 о отл 0i 0 
|: Е: 
. 1 
1 І 1 0 
das, 10 т 019 72 Ро. 
Hn : | 
=- 2-2 ------ L =- rooy 
7777 ! 
da, ip 0 0 0 0 р ! p 
Then 
(36) IM - NI = [1(fi- f)- 
i«J 


The determiaant of the linear transformation (23), (24) is 


(m IEEE TTG, - 5) 278 A-A). 


i<j 


Theorem 13.2.1. The Jacobian of the transformation (19) is the absolute 
value of (37). 


13.2.3. The Joint Distribution of the Matrix E and the Roots 

The joint density of A and B is 

(38) w( AV, m)w( BU, n) = СИА Р-Р ет At, 
where 

(39) c, = per (in) r Gm)] ^ 

Therefore the joint density of E and F is 


1 - П sUa-p-1) 
(40) C,|E'FE| "^? DIE'(I— F)E|* pol 


e WEED P| E' p| DTT (f, — f). 
i<j 
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Since |E'FE| = IE'| -|F| [E| = |F| - | E'E| = ПР, ДЕЕ and [E'(I — F)E| 
= |I — FI -|E'E| = П? .(1 — f)| E'E|, the density of E and Е is 


(41) 
2^C,|E' gj incu tee eT go u "a- - A)" Пл). 


i«j 


Clearly, E and F are statistically independent because the density factors 
into a function of E and a function of F. To determine the marginal 


densities we have only to find the two normalizing constants (the product of 
which is 2^C,). 


Let us evaluate 
(42) 2° f|E' El тпр) 9— itr E'E dE, 


where the integration is 0 < e; < oo, —oo < e; j <œ, j #1. The value of (42) is 
unchanged if we let —oo < e; < oo and multiply by 2-Р. Thus (42) is 


(43) Quy [^ — f^ eenoj 25 || па, 
т ... 'g| enn 4| e; 
-0 Joo (2т т)?" d 
Except for the constant (27r)?”’, (43) is a definition of the expectation of the 
3(m + п — p)th power of |E'E| when the е,; have as density the function 
within brackets. This expected value is the im +n ~ руһ moment of the 


generalized variance |E'E| when E'E has the distribution ИГ, p). (See 
Section 7.5.) Thus (43) is 


T, [50m + п) 


Г, (эр ) 


(44) (27)? 2ipin emp), 


Thus the density of E is 


(45) T (zP) 


|E'E| imtn-p eT EE 
1 -2 122 1 
2pm +п—2), ар г, (т +n)| 


The density of f, is (41) divided by (45); that is, the density of f, is 


P P А 
(46) с Пл" -5)* TIG -£) 
i= = i<j 
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for 0 <f, < + <fi <1, where 


c= me T [i(m +n)] 
2 (anr, (am) GP) | 


The density of /; is obtained from (46) by letting 


| (47) 


(48) frr 
we have 
dfi _ 1 
d; (1+1) 
(49 Lzh 
) 5-57 qne 
1 
mE 
Thus the density of /; is 
P s Р = Кт+и) 
(50) соак) 
i=l i=1 i<j 


for 0<l,< <h. 


Theorem 13.2.2. If A and B are distributed independently according to 
W(X, m) and W(X, n) respectively (m > p, n > p), the joint density of the roots 
of |A — IB| = 0 is (50) where C; is defined by (47). 


The joint density of Y can be found from (45) and the fact that the 
Jacobian is |Y| 7°”. (See Theorem A.4.6 of the Appendix.) 


13.2.4. The Distribution for А Singular 


The matrix А above can be represented as А = W,W,, where the columns of 
W, (p X m) are independently distributed, each according to N(0, X). We 
now treat the case of т <р. If we let B+ WW; = G = CC' and И, = CU, 
then the roots of 


(51) 0= [4-4 + В)| = [ИИ —fG| 
= |CUU'C' — fCC'| =|C\-|UU' — Л, - ICI 
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are the roots of 
(32) |UU' — fr,| = 0. 


We shall show that the nonzero roots f, > = > fm (these roots being distinct 
with probability 1) are the roots of 


(53) |u'U —fl,,| =0. 

For each root f + 0 of (52) there is a vector x satisfying 

(54) (UU' — ft, )x - 0. 

Multiplication by U' on the left gives 

(55) 0 = U'(UU' – fl,)x 
=(U'U — ft, )(U'x). 


Thus U'x is a characteristic vector of UU' and f is the corresponding root. 
It was shown in Section 8.4 that the density of U = U, is (for 1, – UU’ 
positive definite or I,, — U * U; positive definite) 


(56) K| I, _ uu pee? = K| I. —Ux и [p -p* uU 
where p* =m, n* - p* — 1—n—p – 1, and m* = р. Thus fy fm must be 
distributed according to (46) with p replaced by m, m by p, and n by 
nm = p, that is, 

sm T, aC +n) 


67) pn) Си +" РИ) 
м-р ma 


i<j 


Theorem 13.23. If A is distributed as WW, where the m columns of W, 
are independent, each distribuied according to №0, X), m x p, and B is indepen- 
dently distributed according to W(Z,n), n> p, then the density of the nonzero 
roots of |A — f(A + B)| = 0 is given by G7). 


These distributions of roots were found independently and at about the 
same time by Fisher (1939), Girshick (1939), Hsu (19392), Mood (1951), and 
Roy (1939). The development of the Jacobian in Section 13.22 is due mainly 
to Hsu [as reported by Deemer and Olkin (1951)]. 
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133. THE CASE OF ONE NONSINGULAR WISHART MATRIX 

In this section we shall find the distribution of the roots of 

(1) lA — Ш = 0, 


where the matrix A has the distribution W(1, n). It will be observed that the 
variances of the principal components of a sample of n + 1 from N(p, Г) are 
1/n times the roots of (1). We shall find the following theorem useful: 


Theorem 13.3.1. Jf the symmetric matrix B has a density of the form 
gl... p^ where 1, > «°° > l p are the characteristic roots of B, then the density 
of the roots is 


(2) rp) J J . 


Proof. From Theorem A.2.1 of the Appendix we know that there exists an 
orthogonal matrix С such that 


(3) B-CLC, 
where 

L 0 0 

о |, 0 
(4) Boi 

оо l 


If the /’s are numbered in descending order of magnitude and if c; = 0, then 
(with probability 1) the transformation from В to L and С is unique. Let the 
matrix С be given the coordinates C,,...,Cpip- 1/2 and let the Jacobian of 
the transformation be f(L,C). Then the joint density of L and C is 
gl. ЖЕ, C). To prove the thcorem we must show that 


a? Tie (1; 7 1) 
5 e L.C)dc,--d = ім. 
(5) J Iit ,C) de Ср 72 Tp) 

We show this by taking a special case where B = UU' and U (px m, m z p) 
has the density 


-tm г [0n +n)] in i(n-p-1) 
(6) qm? Tn) |z- 00|"). 


zn 
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"hen by Lemma 13.3.1, which will be stated below, B has the density 


Tlen +n)] 


O rümjr(n) 


JI — Bl 2" -P-1)| g| 0р1) 


I [5( n n)] P 

pl2 1 Р 

_ П 1 | РПГ] i(m-p- 
Г, (21 1)Г, (3n) i=l ( l) LIT UU 


-g*(l.....1,). 


The joint density of L and С is КЕ, C)g* 
5 ‚ С)в*(1,...,1,). In the di i 
we proved that the marginal density of L i (50). Thus Peor e sent 


(8) f Је), c)ac-g (sut) f = JFL, C) ас 


Е a? П(1;= 1) * П 
тр) о g (4... .,1,). 


This proves (5) and hence the theorem. и 
The statement above (7) is base on the following lemma: 


Lemma 13.3.1. If the density of Y (p x m) is КУУ’), then the density of 


B-YY'is 


(9) mm 


Г,(2т) 


The proof of this, like that of Theorem 13.3.1, depends on exhibiting a 


special case; let f(YY') = (2) "тет 31 YY' then (9) is w(B|I, m) 


Now let us find the density of the roots of (1). The density of A is 


| n-p- -itr 4(n—p— 
(10) [A] 707 e на ПЕ ! Dexp(- 2 Eh) 


2#"T (зп) 29^] (3n) 


Thus by the theorem we obtain as the density of the roots of А 


ip? i(n-p- | t 
(11) т’ TTP lee? Dexp( - 572. IIT. (0 - 4) 


29" (3n) (3p) 
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Theorem 13.3.2. If A (р x p) has the distribution WI, n), then the charac- 
teristic roots (l, >1,> ++ 21,2 0) have the density (11) over the range where 
the density is not 0. 


Corollary 13.3.1. Let v, > +: =v, be the sample variances of the sample 
principal components of a sample of. size N=n+1 from N(y, а?Г). Then 
(n/a ?)u, are distributed with density (11). 


The characteristic vectors of А are uniquely defined (except for multipli- 
cation by — 1) with probability 1 by 


(12) (4-1)у= 0, уу=1, 


since the roots are different with probability 1. Let the vectors with у; 2 ОБе 


(13) Y= (J. Yp) 
Then 
(14) AY = YL. 


From Section 11.2 we know that 

(15) Y'Y-I. 
Multplication of (14) on the right by Y ! = Y' gives 
(16) A-YLY. 


Thus Y' =C, defined above. 


Now let us consider the joint distribution of L and C. The matrix A has 
the distribution of 


n 
(17) A= M XX, 


where the X, are independently distributed, each according to N(0, Г). Let 
(18) Ха = ОХ., 


where О is any orthogonal matrix. Then the Х* are independently dis- 
tributed according to №0, Г) and 


(19) 4^- Y Xixf = 040' 


a=] 
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is distributed according to W(1, n). The roots of A* are the roots of 4; thus 
(20) А* = С**'ГС**, 
(21) С**'С** = 


define C** if we require c)" > 0. Let 


| (22) C* = CQ. 
Let 
ch 
ет 0 0 
0 EN ... 0 
ж 
(23) qc)-| ® ‚ |, 
c* 
0 0 pl 
lež] 


with с/с —1 if с = 0. Thus J(C*) is a diagonal matrix, the ith 
diagonal element is 1 if сў > 0 and is —1 if сӯ < 0. Thus 


(24) C** =J(C*)C* -J(CQ')CQ". 


The distribution of C** is the same as that of C. We now shall show that 
this fact defines the distribution of C. 


Definition 13.3.1. If the random orthogonal matrix E of order p has a 
distribution such that EQ' has the same distribution for every orthogonal Q, then 
E is said to have the Haar invariant distribution (or normalized measure). 


The definition is possible because it has been proved that there is only one 
distribution with the required invariance property [Halmos (1950)]. It has also 
been shown that this distribution is the only one invariant under multiplica- 
tion on the left by an orthogonal matrix (i.e., the distribution of QE is the 
same as that of E). From this it follows that the probability is 1/2? that E is 
such that e, > 0. This can be seen as follows. Let J,,..., J2? be the 2? 
diagonal matrices with elements +1 and — 1. Since the distribution of ЛЕ is 
the same as that of E, the probability that e,,20 is the same as the 
probability that the elements in the first column of ЛЕ are nonnegative. 
These events for i=1,...,2? are mutually exclusive and exhaustive (except 
for elements being 0, which have probability 0), and thus the probability of 
any one is 1/2?. 
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The conditional distribution of E given ej >20 is 2^ times the Haar 
invariant distribution over this part of the space. We shall call it the 
conditional Haar invariant distribution. 


Lemma 13.3.2. If the orthogonal matrix E has a distribution such that 
e, z 0 and if E** = J(EQ')EQ' has the same distribution for every orthogonal 
Q, then E has the conditional Haar invariant distribution. 


Proof. Let the space V of orthogonal matrices be partitioned into the 
subspaces И»... V2» so that ЛИ, = Vj, say, where J, = Гапа И, is the set for 
which e; > 0. Let р, be the measure in V, defined by the distribution of E 
assumed in the lemma. The measure 4(W) of a (measurable) set W in V, is 
defined аз (1/2?) (ЈИ). Now we want to show that ш is the Haar 
invariant measure. Let W be any (measurable) set in V;. The lemma assumes 
that. 2^4(W) = n (W) = Pr(E € W} = РЦЕ** є И) = YL(QUIWQ' n VD = 
2'4(WQ"). If U is any (measurable) set in V, then U = U 2 (О ПИ). Since 
uU V) = (1/2) [JU NA Vj], by the above this is pU ^ V)Q']. Thus 
ИО) = u(UQ'). Thus м is invariant and ш; is the conditional invariant 
distribution. L| 


From the lemma we see that the matrix C has the conditional Haar 
invariant distribution. Since the distribution of С conditional on L is the 
same, C and L are independent. 


Theorem 13.3.3. №С=У’, where Y = (урь. yp? are the normalized char- 
acteristic vectors of A with y,; 2 0 and where А is distributed according tc 
W(I.n), then C has the conditional Haar invariant distribution and C is 
distributed independently of the characteristic roots. 


From the preceding work we can generalize Theorem 13.3.1. 


Theorem 13.3.4. If the symmetric matrix В has а density of the form 
g(h.. lp), where l> = > l, are the characteristic roots of B, then the joint 
density of the roots is (2) and the matrix of normalized characteristic vectors Y 
(y; 2 0) is independently distributed according to the conditional Haar invariant 
distribution. 


Proof. The density of ОВО’, where QQ’ = І, is the same as that of B (for 
the roots are invariant), and therefore the distribution of JGO'Q?)Y'Q' is the 
same as that of Y'. Then Theorem 13.3.4 follows from Lemma 13.3.2. a 


We shall give an application of this theorem to the case where B — B' is 
normally distributed with the (functionally independent) components of B 
independent with means 0 and variances &b2=1and 652 = 3 (i <j). 
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Theorem 13.3.5. Let B = В’ have the density 
(25) qr 0 0*0/45- 3Pe-iuB) 


Then the characteristic roots 1, > = >l p Of B have the density 


al - _ 1 
G9 — rete Gp e| -3 1t Пи) 
| i i<j 


i=] 


and the matrix Y of the normalized characteristic vectors (у; = 0) is indepen- 
dently distributed according to the conditional Haar invariant distribution. 


Proof. Since the characteristic roots of B? аге /2,...,/2 and tr B? = X2 
the theorem follows directly. a ” 7 


Corollary 13.3.2. Let nS be distributed according to W(1, n), апа define the 
diagonal matrix L and B by S=C'LC, C'C-I, l> -+ >l, and c, =0 
i=1,...,p. Then the density of the limiting distribution of КА -D-D 
diagonal is (26) with 1, replaced by d;, and the matrix C is independently 
distributed according to the conditional Haar measure. 


Proof. The density of the limiting distribution of Vn (S — Г) is (25), and the 


diagonal elements of D are the characteristic roots of Vn (S — Г) and the 
columns of C' are the characteristic vectors. " 


13.4. CANONICAL CORRELATIONS 


The sample canonical correlations were shown in Section 12.3 to be the 
square roots of the roots of 


(1) |44543 -fAnl=0, 

where 
N 

(2) Aya У (XP -xX?)(xp -xeoy, Lj-12 
a=] nd 


and the distribution of 


(3) x= [о 
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is Ми, X), where 

(4 sepe E 

From Section 3.3 we know that the distribution of Aj; is the same as that of 

(5) A; Y YOY, Lj-12, 
а=1 


where n =N — 1 and 


(6) y- (2а) 


y? 


is distributed according to N(0, X). Let us assume that the dimensionality of 
YO), say p}, is not greater than the dimensionality of YO, say P2. Then there 
are p, nonzero roots of (1), say 


(7) EUER 
Now we shall find the distribution of (f£) when 
(8) Ў = 0. 
For the moment assume (У,2)) to be fixed. Then А,, is fixed, and 
(9) B=A,, Az) 


is the matrix of regression coefficients of Ү on YO. From Section 4.3 we 
know that 


n 


(10) Ац = X (x? - BYP (YP ~ BYP) =A), ~ BA, В’ 


а= і 
=Ay - А454 
апа 
(1) Q = BA, В’ = 44541 


(В = 0) аге independently distributed according to W(X,,n-p,) and 
ИС 11, P2), respectively. In terms of О the equation (1) defining f is 


(12) [Q — f(Ay.2 + Q)| - 0. 
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The distribution of f,, i=1,..., ру, is the distribution of the nonzero roots of 
(12), and the density is given by (see Section 13.2) 


13 lp? T, (2n) 
(3) v E BG -pG C) 


Рі REA Р! | 
TH per af) Р-Р! УИ Ff). 


Since the conditional density (13) does not depend upon У“, (13) is the 
unconditional density of the squares of the sample canonical correlation 
coefficients of the two sets X! and ХР, a = 1,..., №. The density (13) also 
holds when the X? are actually fixed variate vectors or have any distribu- 
tion, so long as X? and X® are independently distributed and Х‘ has a 
multivariate normal distribution. 
In the special case when p, = 1, p; =p — 1, (13) reduces to 
1 

(14) SII NTT АТ ГУ D] PEDE pT, 
TiN- p)]rIsCo - 1] 


which is the density of the square of the sample multiple correlation coeffi- 
cient between X? (p, = 1) and X? (p, =р – 1). 


13.5. ASYMPTOTIC DISTRIBUTIONS IN THE CASE OF 
ONE WISHART MATRIX 


13.5.1. All Population Roots Different 


In Section 13.3 we found the density of the diagonal matrix L and the 
orthogonal matrix B defined by S = ВІВ’, > + zl, and 5,20, i= 
1,..., p, when nS is distributed according to W(I, n). In this section we find 
the asymptotic distribution of L and B when n$ is distributed according to 
W(X,n) and the characteristic roots of X are different. (Corollary 13.3.2 


gave the asymptotic distribution when X = I.) 


Theorem 13.5.4. Suppose nS has the distribution W(X., n). Define diagonal 
А and L and orthogonal B and В by 


(1) X-BAB. S=BLB’, 


à»À i >A, Hehe 21, By 20, 0,20, 1=1,...,р. Define 
G = Vn (B — B) and diagonal D = Yn (L — A). Then the limiting distribution of 
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D and G is normal with D and G independent, and the diagonal elements of D 
are independent. The diagonal element d, has the limiting distribution N(0, 2 №). 


The covariance matrix of g; in the limiting distribution of G = (gi, ..., 8p) is 
2 А: Ак П 
(80) = У 2—5 Bu Bre 
e ka (А-А) 
ki 
where B=(B,,...,B,). The covariance matrix of в; and в; in the limiting 
distribution is 
АА; , "P 
«€ (gi, gj)) = — — ——3b;B; ij. 
(3) (& &j) (л) j | 


Proof The matrix nT = иВ’5В is distributed according to W(A,n). Let 
(4) T-YLY, 


where Y is orthogonal. In order that (4) determine Y uniquely, we require 
v, 20. Let Vn (T — A) = U and Ул (Y 1) = И. Then (4) can be written 


o [xs dors en 


1 1 
— U= —W 
(5) At UU [rex 


which is equivalent to 


1 , nal , 
(6) U=WA+D+AW' + 7 (WD +WAW + DW') + = Ири”. 
From I= YY' = [1+ (1/7 Vn WI (1 vn W'], we have 
1 
= ‘+ — ИИ. 
(7) Q-WAW + 


We shall proceed heuristically and justify the method later. If we neglect 
terms of order 1/ yn and 1/n (6) and (7), we obtain 


(8) U - WA 4 D + ЛИ", 
(9) О= ИИ”. 

When we substitute W' = —W from (9) into (8) and write the result in 
components, we obtain wj; = 0, 

(10) d,- uj, і=1,...,р, 
(11) my = Ys i#j, j=l, p. 
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(Note w, = —w;) From Theorem 3.4.4 we know that in the limiting normal 
distribution of U the functionally independent elements are statistically 
independent with means 0 and variances of V(u,;) = 2A? and AY (и) = 
А.Л» 1%]. Then the limiting distribution of D and W is normal, and 
dy. dps Wp Wis р are independent with means 0 and variances 
Y (dj) = 23, i=1,..., p, and AV (wj) = АА (А 7 AY, je i 1... p, 
i—1,...,p-— 1. Each column of B is + the corresponding column of BY; 
since Y 51, we have BY 5 B, and with arbitrarily high probability each 
column of B is nearly identical to the corresponding column of BY. Then 
G -Yn(B—B) has the limiting distribution of В (Y-I) = ВУ. The 
asymptotic variances and covariances follow. 

Now we justify the limiting distribution of D and W. The equations 
T = YLY' and I= YY' and conditions /, > + > 1» Yu > 0,i=1,..., p, define 
a 1-1 transformation of T to Y,L except for a set of measure 0. The 
transformation from Y, L to T is continuously differentiable. The inverse is 
continuously differentiable. in a neighborhood of Y 2 Г and L = ^, since the 
equations (8) and (9) can be solved uniquely. Hence Y, L as a function of T 
satisfies the conditions of Theorem 4.2.3. n 


13.5.2. One Root of Higher Multiplicity 


In Section 11.7.3 we used the asymptotic distribution of the q smallest 


sample roots when the q smallest population roots are equal. We shall now 
derive that distribution. Let 


A, 0 
02) Zr "I 


where the diagonal elements of the diagonal matrix A, are different and are 
larger than A* (» 0). Let 


T, T Y, Y 
11 "| Zh "ài L- 
21 22 


L, 0 
о LI 


Then T 5 A, which implies L^ A, Y, 51, Yp ^0, Y, 5 0, but Y, does 
not have a probability limit. However, Ү,, У, 5r. Let the singular value 
decomposition of Y; be EJF, where J is diagonal and E and F are 
orthogonal. Define C; = EF, which is orthogonal. Let U= /n (1 — A) and 
D=yn(L—A) be partitioned similarly to T and Г. Define №, = 
Yn Q -D, Wo = Vn Ур, Wy = Vn Yp, and Wy = Vn (Yn = С,) = Vn EQ — 
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І.Е. Then (4) can be written 


(14) 


I, 0 +L Wi Wa 

9 Cj Ум Wa 

(^S o) af o 
0 AL) м0 CD,C, 


MN MWC) 
n 


Wy A, MWC, 


A Wi A iW 1 
АСЮ АС, п 


where the submatrices of M are sums of 
products of C, A,, A*L, D}, Wp 
and 1/ Vn . The orthogonality of Y (I, = YY") implies Р Poe 


1 9 1 Wi WC, Wi Wai 1 
0 L| mm maol lew ‚+= 
4 n п Wl, СИ С.И n 


where the submatrices of N are sums of products of W,,. From (14) and (15) 
we find that 


(15) I= 


(16) Uy = С, О.С, + O,(1/ n). 


The limiting distribution of (1/A*)U,, has the density (25) of Section 13.3 
with р replaced by q. Then the limiting distribution of D, and C, is the 
distribution of Dž and Ух defined by Už = Ух D3Y/, where (1/A*)U, has 
the density (25) of Section 13.3. | z 
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Theorem 13.5.2. Under the conditions of Theorem 13.5.1 and A= 
diag( A ,, A*I,), the density of the limiting distribution of d... d, i5 


EP - -4 1 P 2 
(17) 2 (a*r) DAT: рео — zi X |n n). 
i-p-q*1l i<j 


To justify the preceding derivation we note that D, and Ү,, are functions 
of U depending on я that converge to the solution of UX = Y% D% YŠ. We 
can use the following theorem given by Anderson (1963a) and due to Rubin. 


Theorem 13.5.3. Let F(u) be the cumulative distribution function ofa 
random matrix U,. Let V, be a matrix-valued function of Up, V, — f, (u,), and 
Let G,(v) be the (induced) distribution of V,. Suppose Е (и) > F(u) in every 
continuity point of F(u), ana suppose for every continuity point и of fW, 
fhu) > (и) when и, >u. Let G(v) be the distribution of the random татх 
V=f(U), where U has the distribution F(u) If the probability of the set of 
discontinuities of f(u) according to F(u) is 0, then 


(18) lim G,(v) = G(v) 
n= 
in every continuity point of GG). 
The details of verifying that U(7) and 


(19) (D2(1), ¥n(n)) =/.(О(пп)) 


satisfy the conditions of the theorem have been given by Anderson (1963a). 


13.6. ASYMPTOTIC DISTRIBUTIONS IN THE CASE OF 
TWO WISHART MATRICES 


13.6.1. All Population Roots Different 

In Section 13.2 we studied the distributions of the roots /,>/,2 ++ 2l, of 
(1) |S* - iT* |-0 

and the vectors satisfying 


(2) (S* — IT*)x* =0 
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and x*'T*x* = 1 when A* =mS* and В* =nT* are distributed indepen- 
dently according to W(X, т) and W(Z, n), respectively. In this section we 
study the asymptotic distributions of the roots and vectors as п — oo when A* 
and B* are distributed independently according to ИФ, m) and W(X, п), 
. respectively, and т/п > m > 0. We shall assume that the roots of 


(3) Ф — AZ| =0 
are distinct. (In Section 13:2 A, = = =A, 71). 


Theorem 13.6.1. Let mS* and nT* be independently distributed according 
to ИФ, т) and W(X,n), respectively. Let à» А> "> А, (> 0) be the 
roots of (3), and let А be the diagonal matrix with the roots as diagonal elements 
in descending order; let ,,. .., Y, be the solutions to 
(4) (Ф — A, X)v - 0, i=1,...,P; 
y'Zy=1, and yı; 20, and let Г -(Ques vp Leth = cn zl, (> 0) be the 
roots of (1), and let L be the diagonal matrix with the roots as diagonal elements 
in descending order; let ХҮ,..., Хр be the solutions to (2) forl=1,,i=1,...,Ds 
x* T*x* = 1, and xf, > 0, and let X* = (x1,..., x5). Define Z* = yn (X* - T) 
and diagonal D = Уп (L — А). Then the limiting distribution of D and Z* is 
normal with means 0 as n > œ, m > oo, and т/п > т (> 0). The asymptotic 
variances and covariances that are not 0 are 


№(1+ 
(5) вука) 22580. 
POA (Ag ТА, , ; 
(6) ме) = Y, МОКА) uy iy, 
kei (A, — А) 
K#E 
| (7) € (di, zb) = Ау, 
Aj (1 +n) 
8 ем, i*j. 
(8) ( ;) n(A— A. J 
Proof. Let 
(9) S-I'S'T, T-r'T'T. 


Then mS and nT are distributed independently according to W(A,m) and 
WU, n), respectively (Section 7.3.3). Then J;,...,/, are the roots of 


(10) IS - IT| =0. 


PO 
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Let x,,...,x, be the solutions to 
(11) (S - ljT)x - 6, i=1,...,p, 


and x'Tx = 1, and let X = (x,,...,x,). Then x? = Гх, and X* = ГХ except 
possibly for multiplication of columns of X (or X*) by — 1. If Z = Van (X-I), 
then Z* = Г7 (except possibly for multiplication of columns by — 1). 

We shall now find the limiting distribution of D and Z. Let Yn(S — A)-U 
and Vn (T — Г) = V. Then U and V have independent limiting normal distri- 
butions with means 0. The functionally independent elements of U and V are 
statistically independent in the limiting distribution. The variances are Фиг 
= п/т)А№ >2М/т; Фи? = (иутум А, > МАИ 61 72; Ф = 1, 
i*j. : 

From the definition of L and X we have SX = TXL, Х'ТХ = I, and 
X'SX = L. lf we let X^! = б, we obtain 


(12) S=G'LG, T-G'G. 


We require g; > 0, i = 1,..., p. Since 5 ^ A and TST, we have L ^ А and 
G 5 I. Let Vn (G — I) =H. Then we write (12) as 


(13) ^0 [1+ PL 


[A+ рн), 


(14) I+ = [1+ ден! [1+ xu) 


These can be rewritten 


y 1 
(15) U=D+AH+H'A+ ЧЕ (DH+H'D+H'AH) + нон, 
n п 
1 
16) И=Н+Н' + —H'H. 
09 hs 


If we neglect the terms of order 1/ Vn and 1/n (as in Section 13.5), we 
can write 


(17) | U=D+AH+H’'A, 
(18) У=Н+Н’, ` 
(19) |^  U-VA=D+AH-HA. 
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The diagonal elements of (18) and the components of (19) are 


(20) | va = 2h, 
(21) uj ABa = di, 
(22) Wij — và = (А Ah, ij. 


The limiting distribution of H and D is normal with means 0. The pairs 
(л; hj) of off-diagonal elements of Н are independent with variances 


l Aj(A; + nÀ; 
(23) AY (hi) = AO) іж], 
"(A; = Àj) 
and covariances 
АА (1+ 
(24) оге (hy, hy) = МАО m ity. 
"(A; = Àj) 


The pairs (d,,h,;) of diagonal elements of D and Н are independent with 
variances (5), 


(25) AY (his) = РА 
and covariance 


(26) | AG (di, hi) = — X. 


1 


The diagonal elements of D and H are independent of the off-diagonal 
elements of Я. = 

That the limiting distribution of D and H is normal is justified by 
Theorem 4.2.3. S and T are polynomials in L and С, and their derivatives 
are polynomials and hence continuous. Since the equations (12) with auxiliary 
conditions can be solved uniquely for L and G, the inverse function is also 
continuously differentiable at L = A and G-1. By Theorem 4.2.3, 
D — /n(QL — A) and H= үп (С — Г) have a limiting normal distribution. In 
turn, Х= G^! is continuously differentiable at G =I, and Z= Vn (X-I) 
= Yn(G ! — I) has the limiting distribution of —H. (Expand Vm([I- 
(1/ Yn)H] ! – П.) Since G 5 1, X 5 1, and x» 0, = 1,..., p. with proba- 
bility approaching 1. Then Z* = үп (X* — Г) has the limiting distribution of 
LZ. (Since X 51, we have X* - FEX ^T and x,» 0, i=1,...,p, with 
probability approaching 1.) The asymptotic variances and covariances (6) to 
(8) are obtained from (23) to (26). а 
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Anderson (19895), has derived the limiting distribution of the characteris- 
tic roots and vectors of one sample covariance matrix in the metric of another 
with population roots of arbitrary multiplicities. 


13.6.2. One Root of Higher Multiplicity 


In Section 13.6.1 it was assumed that mS* and nT* were distributed 
independently according to W(®, m) and W(X, n), respectively, and that the 
roots of |® — AX| = 0 were distinct. In this section we assume that the k 
larger roots are distinct and greater than the p — k smaller roots, which are 
assumed equal. Let the diagonal matrix A of characteristic roots be A = 
diag( A, A*J,_,,), and let Г be a matrix satisfying 


(27) ФГЕУГА, I'"Xr-[r. 


Define $ and T by (9) and diagonal L and G by (12). Then S ^ A, TI. 
and L 5 A. Partition S, T, L, and G as 


Su 5 Т. T, 
= Н Т = , 
(28) 5 В $5 T, T» 


where S,,, Ти» Li, and G,, are k X k. Then би 1. Gn ^9, and С ^ 0. 
but G, does not have a probability limit. Instead G5,G;; ^ I, .,. Let the 
singular value decomposition of С, be EJF, where E and F are orthogonal 
and J is diagonal. Let С, = EF. 

The limiting distribution of U — Vn ($ — A) and V= /n(T – I) is normal 
with the covariance structure given above (12) with A,,, = 00 =A = А. 
Define D = Vn (L А), Hy, = Уп (Gy, — D, Hy = n Gy, Hy, = Уп Gx, and 
Н» = Yn (Gy - C,)= vn E(J – I, ЭЕ. Then (13) and (15) are replaced by 


29 A, 0 + 1. Ur Uy 
( ) 0 AH, x yn | Ол Un 
lg lg А, + d.p 0 
I Bn Tr Pn Dt UR 
“| lip Ли, x 
ys Hn C, + ЕН» 0 А 1-4 + -" D. 
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001 1 
+ ——H —H, 
k Vn и Ja 
| 1 
—H. C. + A» 
ín c > yn 


1 [H A, АНС, «ol 
t НА, АНС, р , 


and (14) and (16) are replaced by 


[ва в | Ha |+ pan 
Yn H3 С.Н». Yn 


1 
= 
om 
QU 

' € 
х 
а 


If we neglect the terms of order 1/ Уп and 1 /n, instead of (12) we can write 


(31) 
В “Vay 07 MAI 


и, - VA, 0, – А V» C; H, (AI — и) C; DC 


Then 0, = 24, #=1,...,Ю; uj — Aj =d;, #=1,....Ю шут ША = 
(A, - Ah, Ej, hj m Less KS Un — № Va = CD03; СО — Va A,)= 
HoT A); and (Qj, — ИС, = CT ~ АН. The limiting distri 
bution of U, – АИ, is normal with mean 0; ёа Muy = ё іг = 
21+ 1) т, im kel... p; and Qj; №0) = X А+ n/m, 19) 
i,j=k+1,...,p. The limiting distribution of D, and С; is the distri “ion 
of D, and С, defined by Uy — АУ = C; D.C, where (1/A* XU, — Иә) has 
the density of (25) of Section 13.3. 


D, £ AH = Н.А, РИ 


13.7 ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 555 


13.7. ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 


13.7.1. Both Sets of Variates Stochastic 


The sample canonical correlations [,...,/,, and vectors 6,,..., 4,,, and 
4v $,, are defined in Section 12.3. The set qv $,, and [,...,1 
defined by 


(1) $5815 $54 = 591, Y'S? = 1. 


The asymptotic distribution of these quantities was given by Anderson 
(19993) when X = (X, XO"'y has а normal distribution and also when ХӘ 
is normally distributed with a linear function of nonstochastic X? as 
expected value. We shall now find the asymptotic distribution when X has a 
normal distribution. The model in regression form is 


(2) x = BX® +Z, 


where X? and Z are independently normally distributed with expected 
values €X® = 0 and £Z = 0 and covariances #ХОХО = X», 4ZZ' = X;; 
(€XzZ’ = 0). Then £X = 0 and XVX =£, = Х,, + BX,B' and 
& XO XQO' = BË. Inference is based on a sample of X of п observations. 

First we transform to canonical variables U = A'XO, V-T'XO, and 
W — A'Z. Then (1) is transformed to i 


(3) | И=®ФУ+И, 


where Ө = А'В(Г’)', EUU’ = Xyy=1,, EW' =Xyy=l,,, EUV = Xyy 
=(А,0)=Л, EWW' = Хуу = 1, — A^, and ФРИ’ = 0. [See (33) to (37) 
and (45) of Section 12.2.] Let the sample covariance matrices be Syy= 
A'S, A’, Sy; 7 A'S pI, and Sy, = Г'8,,Г. Let the sample vectors consti- 
tute H- T^! f = T^!(4,,..., $,,). Then Н satisfies 


(4) SyySy)SyyH = SyHÀ", B'S, H=, 


where А* == diag(A,,..., А, »0,...,0}; if p, <p», there are p, — p, Osin At. 
We have Syy 5 Lp, Sy, 5 I, , Suy А. Then А = diag(A,,...,A,) 5 A. 
Let 


Hy, Hj; 
G не (нон) =| a ui 
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where E, is p, x d ; _ Е | 
ot (4) are Pim an Hn is (p; — pj) X Cp; — pj). The first p, columns 


(6) Syy Sys Suv Hy = Sy, H, А2; 


the last P2— p, columns of (4) are Sj, H, = 0. Then H, 5 I,» Hj; 5 0, and 
Н, ? 0, but the probability limit of (4) only implies Н.Н» 4,1 Let 
the singular value decomposition of H, be Н,, = EJF. , 
Hai: Siu = Yn (буи —1,), Sá, — Vn Gy, — 1,), Shy = Vn (у A), 
Ў = Vn (H, - 1, у), and A* = [Vn (Â — A),0], wh - | 
expansion of (6) yields 0h where foo (ps ini 


27рі' 


o redes] sa) [osc] ie La 
= (n, + 8 (доз XH [a+ ea} 


From (7) we obtain 


ж А A! A A ATA 
(8) ЗА, АЛІ, у A'Siy Ip, + A’ AH? 
= буи, А? + HFA? + 21,5 № +0, (1). 
From HiS, M, =I, we derive 
(9) OA + Hy = -S$y +0,(1). 


In terms of partitioned matrices (8) is 


(10) 


Shy A — ASIA + ASZ) — SEA? 
Shu А -Spy A? 


2A A* + Ht А? AHY 
НА А? 


+0,(1). 


The lower submatrix equation [( p, — p;) хр,|оЁ (10) is 


_ 21 
(1) НЛ = 552 — seta + 0,(1) = Sf +0,(1) = (Si y +о,(1). 
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A diagonal element of the upper submatrix equation of (10) is 
12) a=b E [Qu — A) = и 7 1) 3A, 7 1] +00). 
/n uc та та 1 2'd ij 2^[X Ча р 


The right-hand side of (12) is the expansion of the sample correlation 
coefficient of u;, and Vja. See Section 4.2.3. The limiting distribution of A7 is 
Мо, (1 — А22]. 

The (i, j)th component of Hj, in (10) is 


(13) 
1 n 
(Aj - AF) Ae = Vn У (Аш и + ÀU; U; — МА jU jal ja Аа) +0,(1). 


Ја ja ^а ја J 
n а=1 


i*j i,j—l....py 


The asymptotic covariance of (А? — A7)h¥, and (А? — АРИЯ is 


и ГОО 2) aono 0)0 +) 
09) ü-2)0-3)93eX3) Q-A3)(*X -2NX)* 


The pair (АЎ, 2) is uncorrelated with other pairs. 
Suppose р; =р,. Then ЃЁ* = ГНА =ГН*. Let Г -(y,.. Yp) P= 
(31,..., Pp). Then 97 = LP y; т, where h*, i +j, is obtained from (13) and 


hž from (9). We obtain И у 


, Pr AHA AZAT 
(15) né($,7 v)(9- = у + 0 7 X) UNO 


& (9-м) 
1- A3)(1—- 2) (А+ А 
(16) паба). Im a Dy, jel. 
А j ^ 


Anderson (19992), has also given the asymptotic covariances of &; and of ¥, 
and &,. Note that Aj, depends linearly on (uj, Vja) and that the pairs 
(Uia Uia) and (Ujas Uja), іж, are uncorrelated. The covariances (14) do not 
depend on (U,V) being normal. 

Now suppose that the rank of Г,, is k « p,. Define H, -(Hj.HY as 
the first К columns of H satisfying (4), and define А, = diag( À,...., Ак). 
Then Н, satisfies (6), and H} satisfies (8), (9), (10), and (11). Then A7 is 
given by (12) for i = 1,..., К. 
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The last p, — k columns of (4) are 


12 LA? HE 
un [sie - | PIE 
Hence 
(18) A H$ = -S5y Co * o, (1) = — $5? С, +0,(1). 


1372. One Set of Variates Stochastic and the Other Set Nonstochastic 


Now consider the case that ХО) in (2) is nonstochastic, where 4Z, = 0 and 
&Z,Z', = У зз. We observe X —x,,..., x,. We assume 

n 
(19) 55 = Y xx Xs, 
а=1 


and X., is nonsingular. Then 


Define A, о, and y by solutions to 


, -A(Xz;*BS5B' BS» || а -0 
(22) SzP Md | у 
(23) «'(X,; + В5В’)а=1, Y'SoYy- 1. 


We shall first assume p, = р, and А, > * > Àp, > 0. Then (22) and (23) 
and а, 0 define 
(24) diag(A,,..., A) = A; («,..., м) = A, (Yie Yu) = Г». 


Let U- АХ, v, ух, aln, Wo AZ, O= ALBO! = 
-rf Then H and Á satisfy (4), Then Syy =] 


p 
(23) Syr = Sy, + Sy, 7 @ + Sy, ^ ©, 


p 
(26) Suu = 05,, 9 + OSyy + Su, + Syy > 1. 
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Then (4) can be written 


27) (A+Syy)(A2 + ASyg  Syp A + Syy) (A Syy)H — НА. 
у VW и 


Note that Spy 5 0, Syy 5 1— А?, and hence H SI, ASA. 
Let 5% и = Yn Syw, S$w = Vn [Syw — (4 — А?)]. Then (27) leads to 


28 I— AStA AS (1 А2) - А5 A+ АН" 
VW wv 
= H*A? +2A A* +0,(1). 


A diagonal term of (28) gives 


(29) A= 0-3) tate pagr È Do - AP] eo. 


м Yn. gat 
Ѕіпсе 
(30) — (пт) =, (1 №), 
(31) elw} - (1- X -2(1- Xy 


under the assumption that W is normally distributed, the limiting distribution 
of Vn (А, — A) is N[0,(1 — A2? — 3A2)]. Note that this variance is smaller 
than in the case of X stochastic. 

From (28) we find 


(4-9) X [a- M)u wj, А + Ава (1 А) 7 Аи A] 


iWiaWja ^j 


R 
1 
- 


Then 


(33) (л) 6 (ht y э (1- №) (1-2) (№ - a} - №). 


The equation H'Sy,H = І implies H'H = I, leading to Н* = —H*' + 0,(1), 
that is, Аў = —h5  o,(1. 

Now suppose that the rank of B is k «p; =р,. Then А = diag( A, 0), 
where A, =diag(A,,... А). Let P = (I, D), where T, has k columns and 
Г, hasp, — К columns. Define the partition (5) to be made into k and Pi; -k 
rows, and columns. The probability limit of (4) implies Hj 5p, Hy 50, 
Н, 5 0, and Hy, Hy 5 I. Let the singular value decomposition of Hy, be 
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EJF, where J is a diagonal matrix of order pj;-k and E and F are 
orthogonal matrices of order p, — k. Define С, = EF. The expansion of (4) 
in terms of Sg, = Yn (Sy, — A), Spy en (S$y — A), Spy  YnlSy,, — 
Ч- AD) Hi-Y/n(H,-I1, H&-YnHy, НЕ =У Ни, and Hi- 
Vn (Hy — С) = Vn E(J — DF yields 


ж 


(34) А 5% (I— A1) + (1— А) УА, ~ A, SHA, A, SEC, 
52А, 0 


2А, А +H} A} — МН -ABS 
H3, Л? 0 


+0,(1). 


` The ith diagonal term of (34) is (29) for i=1,...,k. The i, jth element of the 
upper left-hand submatrix is (32) for {=} and i,j=1,...,k. Two other 
submatrix equations of (34) are 


(35) A, Hi, = Sy C, +0,(1), 
(36) НАЛ, = $$ +0,(1). 
The equation 7 = H'Sy,H = H'H yields 


Ay + НЕ Н + НС, 


7 
(37) CLA +H C, HŠ + HEC, 


=0+0,(1). 
The off-diagonal submatrices of (37) agree with (35) and (36). 


13.7.3. Reduced Rank Regression Estimator 


When the rank of В is specified to be К (<p,), the maximum likelihood 
estimator of В is 


(38) B, = Sp Ô Îi. 


See Section 12.7. In terms of (3) the reduced-rank regression estimator of @ 
is 
(39) Ô, - Sy, H,H;. 


Ө; -/n(6,— Ө), Hr -/n(H,—-14), Shy – Уп (8уу – A), and. Sž, 
= үп (5уу ~I). From H;S,,H,-1 we find ЧЕ + HA = – 591 + 0,(1). 


Suppose X® is stochastic and Ө = diap(9,,0) = diag( A,,0). We define 
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From (39) and (9) we obtain 


A Si + A (H+H) АНЯ +0,(1) 
(40) OF = 2 P 
. $ 0 
ll x12 
_ Syv Swy + (1) 
Sy, 9 


We can compare 6, with the maximum likelihood estimator unrestricted by 
age A -1 
a rank condition Ө = Sj, $,,. Then 


(41) 6" = Vn (Ô - Ө) = (Sy, - 95) Sv 
sz x12 


wy M 


21 x22 
Syv Swy 


= 5%, + 0,(1) = | 


since $,, 5 I. The effect of the rank restriction is to replace the lower 
VV Ы 
i ter value). 
right-hand submatrix of S$, by 0 (the parame о, 
" Since Sž y = (1/ Ут)", Va we have vec 5* „= (уп) КИ, ® W,). 
Because V, and W, are independent, 


(42) & vec 5% (vec Shr)" | 
= EW' 9 £WW' 218 (1- А?) = diag(I— А?,....1- A’). 
where A = diag( A,,0) and I — A? = diag(I — A3, Г). On the other hand 
^ 1 c р, (7) (2), (1) 
*=vec— > WV, y Oo, 
(43) vec Ө; = ve ^ i| А 
yo e W, 


а 


| +0,(1), 


1 n 
"A bie [^ 


where И. = (V, Oy and W, = (WP, W®'Y. Then 


(44) 
£ vec Ёў (vec ө) > 


= diag( I, — A3, I, - AS I, MO a = Ал. 0). 


iag(I, — А1,0). 
where there are k blocks of I, — А? and р, – к blocks of diag(J, 1 
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In the original coordinate system 
(45) vec(B, — В) = vec| ( А’) ' (©, - e)r' 
= [ге(д) 7 есд - ө) 
= [(г,,г,) exz (A A20 А?) ес - ). 


From (44) and (45) we obtain 


^ 


(46) & vecn (B, — B)|vec( B, - B)]' 
» [mr exzA(,- А") "| 
+ (PDs @ E22 A (1, At) A,Xzz] 
-[nrezz]«[nrezxzA(n-A)' А 
=}; @ X, – (Г.Г. ® ХА, АХ). 


If we define Q = X,,T, = zz A AO- A3)! and П = Г, then В = ОП". 
We have 


(47) O(0'E73.0) Q'-Xz-Xz A;AjXz 

(48) ПП хп) И = ГГ; = Xy - ГЕ. 

Thus (46) can be written 

(49) & vec ВЕ (vec BE)! > Xx ® Xz; [кф n(mz,,n) 'm] 
е|х,, -a(z a]. 


Theorem 13.7.1. Let (XU', ХО), a=1,...,n, be observations on the 


- random vector x, with mean 0 and covariance matrix X. Let B= Х.Х. Let 


the columns of Г, satisfy (1) and ў, > 0. Suppose that xO BX? =Z is 
independent of XO. Then the limiting distribution of vec В; = Vn vec B, — B), 
with B, = S,, P C, is normal with mean 0 and covariance matrix (46) or (49). 


Note that B = ОП’ = QM'(ILM^!)' for arbitrary nonsingular M; how- 
ever, (47) and (48) are invariant with respect to the transform ation Q > QM 
and If ПМ-'. Thus (49) holds for any factorization B = ОП”. 
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The limiting distribution of В only depends on vn$7,5;/ = 
(A)! S&y Sy, T' and hence holds under the same conditions as the asymp- 
totic normality of the least squares estimator B. 

Now suppose that XO =x®, а = 1,...,п, is nonstochastic and that (19) 
holds. The model is (2); in the transformed coordinates [U = A', x0), р, = 
Tx, у= А2, Ө = АВГ)! = A] the model is (3). H, = V^! T, satis- 
fies (34) and (37). Again (39) holds. Further, (42) and (43) hold with V, = v, 
nonstochastic. 


Corollary 13.7.1. Let х(2,..., х0) be a set of vectors such that (19) holds. 
Let x? = Bx? +z,, а= 1,...,n, where z, is an observation on a random 
vector Z with €Z=0 and 677' = X;z. Suppose B has rank К. Then the 
limiting distribution of Уп vec(B, — B) is normal with mean 0 and covariance 
(46) or (49). 
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13.8.1. Observations Elliptically Contoured 


Let x,,..., xy be N observations оп a random vector X with density 
(1) Il ge [(x v)! (x -v)], 


where W is a positive definite matrix, R?—(x—v)W '(x—v) and 
ER? «oc. Define k-péR*/[((&R?Y(p -2)]—1. Then éX-v — и and 
&(X — v»XX - vy = CCR?/p) = X. Define x and S as the sample mean 
and covariance matrix. Define the orthogonal matrices B and B znd the 
diagonal matrices A and L by 


(2) х =ВАВ’,  S-BLB', 
> > А, h»c 2l, Ba 20, ba = 0, i=1,..., p. As in Section 13.5.1, 
define T= B'SB = YLY', where Y — B'B is orthogonal and y; > 0. Then 
éT- B/XB-A. 

The limiting covariances of VN vec( S ~ X) and YN vec(T — A) are 


(3) dim Ne уес( $ — X)[vee( S - X)] 


=(«+1)(1,2+K,,)(2 9 X) * x vec £ (vec х)’, 
(4) Jim Né vec(T — A)[vec(T— A)]' 


= (к+1) (1,2  K,,) + к vec I, (vec 1,). 
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In terms of components Zt, j ^ А№8;; and 
(5) Jim Né(t; 7 Aj (ta — Львы) 
= (ct 1) (4/385, 6, + №№ дуд) + кл, Ag в би. 


Let V/N(T- A) -U, /N(L—A)- D, and VN(Y-1)-W. The set 
Up... Upp аге asymptotically independent of the set CREET р, the 


covariances u,,, i * ј, are mutually independent with variances (x - DAA 


i^ 
the variance of u; = d; converges to (3k + 2)A?; the covariance of ui - d, 
and u,, = d,, i + k, converges to кА; Л». The limiting distribution of w, ріж], 
is the limiting distribution of и, ИО, — А). Thus ће и, j i<j, are asymptoti- 
cally mutually independent with £ we=(k+ Da, АСА — №. These vari- 
ances and covariances for the normal case hold for x = 0. i 
Theorem 13.8.1. Define diagonal A and L and orthogonal B and B by (2), 
М> DAL > >l, Ba > 0, by > 0, i=1,..., p. Define 6 = /N(B - 
P) and diagonal D = VN (L — A). Then the limiting distribution of G and D is 
normal with G and D independent. The variance of d; is О -- 3k)A, and the 


covariance of d; and d, is кА, À,. The covariance of g; is 


(9 (в) = (1. x) Ў Ма p 
ia (АА) 


The covariance matrix of g, and 8; is 


А; А, В 
(7) AG ( g,, g;) = —(1+ k) —— 35 BB), 1%]. 


(а= А) 


Proof. The proof is the same as for Theorem 13.5.1 except that (4) is used 
instead of (4) with x = 0. E 


In Section 11.7.3 we used the asymptotic distribution of the smallest q 
sample roots when the smallest q population roots are equal. Let A = 
diag( A ,, A*7,), where the diagonal elements of (diagonal) A, are different 
and are larger than A*. As before, let U = /N(T — A), and let U,, be the 
lower right-hand q xq submatrix of U. Let D, and Y, be the lower 


right-hand q X q submatrices of D and Y. It was shown in Section 13.5.2 that 
Un = Y; D Yp + 0,(1). 
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i PI А, 15 
The criterion for testing the null hypothesis А, 44: y 


HE, qui li 
(8) (EL, qu L/q). 


In Section 11.7.3 it was shown that -N times the logarithm of (8) has the 
n Л. 


limiting distribution of 


1 ‚ i U ‘| 
(9) z a[i VaV gU 2) 


The term и а the imitin di tribution of 1 +к А /2* The 
( ) Xala- )/- 
Yi <} h S 1 E 5 


i u..) is normal with mean 0 and 
limiting distribution of (u, ., 1, pq» “ap 


ae? imiting distribution of 
i i 2 L+xee'jA**. The limiting butic 

atrix A*?[2(1 + (OL, mitin о 

(2 (Ss y /q]*? is 21 + iQ? 21. Hence, the limiting distribution 

uj — Hi ; 

(9) is the distribution of (1 + к) ха /2-1° —. 
We are also interested in the characteri 5 and 
covariance matrix in the metric of another covariance . 


stic roots and vectors of one 


i ixofas le of size 
le covariance matrix of a sample o, 

13.82. Let S* be ше sampie co EN 

from (D. and let Т* be the sample covariance matrix of a sample of size А 
Mara with W replaced by È. Let А be the diagonal matrix A а > ub ? 
> 0 ^ А, ar? the roots 0 - = 0. 
diagonal elements, where à, . ... Ар i r- 20 

te Pere ) be the matrix with y; the solution of er AEN ; 
Syn e FH 0. Let X* = (x,..., x5) and diagonal L* consist of the 

ү'Хү = 1, апа y, z 9. 


solutions to 
(10) (s* —IT*)x* = 0, 
;mitine distribu- 
M/N > т, the limiting distri 
Т = 1, nd x* > 0. Аз М = оо, М > co, А | MM 
Е p - VN CX — Г) and diagonal D* = J/N(L — A) is normal with th 
ion - | 


following covanances. 


1+7 


(п) «Y(d)-Qt39X3— 


1+7 
{12) AG (d;,d;) = Kr Ay 
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P АТА, , 243k , 
(13) We(z;) = (1 «) У Os ио + — g YY 
i=l (А-А) 
. 2+3к 
(HM) — e(di.z) = AY 
AAC +n) , x I 
(15) ме) = (к) ту + фур іж}, 
п(А=А) 


(16) (4.2) = FANY 


Proof. Transform S* and T* to 5 =Г'5*Г and T-T'T*T, Ф and X 
to А=Г’ФГ and Г=Г’УГ, and X* to X-T^7!X*-G 1 Let D= 
VN(L-A), H2 УМС - D. U- /NGS — A), and V- /N(T-I). Tte 
matrices U and Г and D and Н have limiting normal distributions; they are 
related by (20), (21), and (22) of Section 13.6. From there and the covariances 
of the limiting distributions we derive (11) to (16). и 


13.8.2. ЕШрисаПу Contoured Matrix Distributions 


Let Y (pX №) have the density g(tr YY’). Then A —YY' has the density 
(Lemma 13.3.1) 


aS А| KN-p-d) 
en GN) 


e(t A). 


Let A = BLB’, where L is diagonal with diagonal elements / > = > 1, and 
B is orthogonal with b; > 0. Since g(tr А) = g(ZP. \/,), the density of 1,..., 7, 
is (Theorem 13.3.4) 


m? (УР 1) П, (Е т l;) 
18 — ae C, 
(18) Г,(5р) 


and the matrix B is independently distributed according to the conditional 
Haar invariant distribution. 
Suppose Y* ( p x m) and Z* (p x n) have the density 
Wert? gl (ye "poy* tZ*'w-z2*)] (m,n >р). 
Let C be a matrix such that CWC’ = I. Then Y = CY* and Z = CZ* have 
the density g[tr(YY' + ZZ0]. Let A* —Y*Y*', B* —Z*Z*', A=YY', and 
B = ZZ'. The roots of |4* — 1B*| = 0 are the roots of (A — 18| = 0. Let the 


PROBLEMS 567 


roots of |A- f(A-- B)) 20 be f>- >f, and let Е = diag(f,,..., fp). 
Define E (p X p) by A+ B — E'E, and A = E'FE, and ei 20,i=1,..., p. 


Theorem 13.83. The matrices E and F are independent. The densit» of Е is 


qp [i(m 4n) 2 р А 
19) L—— i. dm -p-1) 1—£y-n =f); 
09 ег, н) ЦИ" Царю 
the density of E is 
2РГ 1 ip(n+m—p) 
(20) р(2р)т |E' E|*r n *^7P)g(tr E'E), 


Dein in Die p [S (т+ п) 

In the development in Section 13.2 the observations Y, Z have the densi 
р 

(21) (27) ipm) en iO Yzu). (27) — pln+m) giam 


and in Section 13.7 g[tr(Y'Y + Z'Z)] = g[tr(A + В)]. The distribution of the 
roots does not depend on the form of g(-); the distribution of E depends 
only on E'E — A + B. The algebra in Section 13.2 carries over to this more 
general case. 


PROBLEMS 


13.1. (Sec. 13.2) Prove Theorem 13.2.1 for p=2 by calculating the Jacobian 
directly. 


13.2. (Sec. 13.2) Prove Theorem 13.32 for р=2 directly by representing the 
orthogonal matrix С in terms of the cosine and sine of an angle. 


13.3. (Sec. 13.2) Consider the distribution of the roots of |A — IB| = 0 when A and 


В are of order two and are distributed according to W(X, т) and W(X, п), 
respectively. 


(a) Find the distribution of the larger root. 
(b) Find the distribution of the smaller root. 
(c) Find the distribution of the sum of the roots. 


13.4. (Sec. 13.2) Prove that the Jacobian 19(6, 40/ CE, F)| is ПС, — fj) times a 
function of E by showing that the Jacobian vanishes for f; =f, and that its 
degree іп f; is the same as that of If, — f). 


13.5. (Sec. 13.3) Give the Haar invariant distribution explicitly for the 2 x 2 orthog- 
onal matrix represented in terms of the cosine and sine of an angle. 
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13.6. (Sec. 13.3) Let A and B be distributed according to W(X,m) and W(X, п) 
respectively. Let > +-- >l, be the roots of |A —1В| = 0 and m,» = > m 


be the roots of |A — m | = 0. Find the distribution of the m's from that of thé 
Ps by letting n > oo. 


13.7. (Sec. 13.3) Prove Lemma 13.3.1 in as much detail as Theorem 13.3.1. 
13.8. Let A be distributed according to W(X, n). In case of p = 2 find the distribu- 


tion of the characteristic roots of А. [ Hint: Transform so that X goes into a 
diagonal matrix.] 


13.9. From the result in Problem 13.6 find the distribution of the sphericity criterion 
(when the null hypothesis is not true). 


13.10. (Sec. 13.3) Show that X (p x n) has the density f,(X’X) if and only if T has 
the density 


2 Pay pn/2 P 


1,072; Пе faa 


where T is the lower triangular matrix with positive diagonal elements such 


that ТГ’ = X'X. [Srivastava and Khatri (1979). [ Hint: Compare Lemma 13.3.1 
with Corollary 7.2.1.] 


13.11. (Sec. 13.5.2) In the case that the covariance matrix is (12) find the limiting 
distribution of Di, Wip W,;, and W;,. 


13.12. (Sec. 13.3) Prove (6) of Section 12.4. 


CHAPTER 14 


Factor Analysis 


14.1. INTRODUCTION 


Factor analysis is based oa a model in which the observed vector is parti- 
tioned into an unobserved systematic part and an unobserved error part. The 
components of the error vector are considered as uncorrelated or indepen- 
dent, while the systematic part is taken as a linear combination of a relatively 
small number of unobserved factor variables. The analysis separates the 
effects of the factors, which are of basic interest, from the errors. From 
another point of view the analysis gives a description or explanation of the 
interdependence of a set of variables in terms of the factors without regard to 
the observed variability. This approach is to be compared with principal 
component analysis, which describes or "explains" the variability observed. 
Factor analysis was developed originally for the analysis of scores on mental 
tests; however, the methods are useful in a much wider range of situations, 
for example, analyzing sets of tests of attitudes, sets of physical measure- 
ments, and sets of economic quantities. When a battery of tests is given to a 
group of individuals, it is observed that the score of an individual on a given 
test is more related to his scores on other tests than to the scores of other 
individuals on the other tests; that is, usually the scores for any particular 
individual are interrelated to some degree. This interrelation is “explained” 
oy considering a test score of an individual as made up of a part which is 
peculiar to this particular test (called error) and a part which is a function of 
more fundamental quantities called scores of primary abilities or factor scores. 
Since they enter several test scores, it is their effect that connects the various 
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test scores. Roughly, the idea is that a person who is more intelligent in some 
respects will do better on many tests than someone who is less intelligent. 

The model for factor analysis is defined and discussed in Section 14.2. 
Maximum likelihood estimators of the parameters are derived in the case 
that the factor scores and errors are normally distributed, and a test that the 
model fits is developed. The large-sample distribution theory is given for the 
estimators and test criterion (Section 14.3). Maximum likelihood estimators 
for fixed factors do not exist, but alternative estimation procedures are 
suggested (Section 14.4). Some aspects of interpretation are treated in 
Section 14.5. The maximum likelihood estimators are derived when the 
factors are normal and identification is effected by specified zero loadings. 
Finally the estimation of factor scores is considered. Anderson (19842) 
discusses the relationship of factor analysis to principal components and 
linear functional and structural relationships. 


14.2. THE MODEL 


14.2.1. Definition of the Model 


Let the observable vector X be written as 
(1) X=Af+Ut+np, 


where X, U, and p are column vectors of p components, f is a column 
vector of m (<p) components, and A isa p Xm matrix. We assume that U 
is distributed independently of f and with mean 20 = 0 and covariance 
matrix ZUU' = Ф, which is diagonal. The vector f will be treated alterna- 
tively as a random vector and as a vector of parameters that varies from 
observation to observation. 

In terms of mental tests each component of X is a score on a test or 
battery of tests. The corresponding component of р, is the average score of 
this test in the population. The components of f are the scores of the mental 
factors; linear combinations of these enter into the test scores. The coeffi- 
cients of these linear combinations are the elements of A, and these are 
called factor loadings. Sometimes the elements of f are called common 
factors because they are common to several different tests; in the first 
presentation of this kind of model [Spearman (1904)] f consisted of one 
component and was termed the general factor. A component of U is the part 
of the test score not *explained" by the common factors. This is considered as 
made up of the error of measurement in the test plus a specific factor, having 
to do only with this particular test. Since in our model (with one set of 
observations on each individual) we cannot distinguish between these two 
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components of the coordinate of U, we shall simply term the element of U 
the error of measurement. 

The specification of a given component of X is similar to that in regres- 
sion theory (or analysis of variance) in that it is a lineur combination of other 
variables. Here, however, f, which plays the role of the independent variable, 
is not observed. 

We can distinguish between two kinds of models. In one we consider the 
vector f to be a random vector, and in the other we consider f to be a vector 
of nonrandom quantities that varies from one individual to another. In the 
second case, it is more accurate to write X, — A f, +U + p. The nonrandom 
factor score vector may seem a better description of the systematic part, but 
it poses problems of inference because the likelihood function may not have 
a maximum. In principle, the model with random factors is appropriate when 
different samples consist of different individuals; the nonrandom factor 
model is suitable when the specific individuals involved and not just the 
structure are of interest. 

When f is taken as random, we assume &f=0. (Otherwise, £X = 
A éf * p, and p can be redefined to absorb А £f.) Let &ff' = Ф. Our 
analysis will be made in terms of first and second moments. Usually, we shall 
consider f and U to have normal distributions. If f is not random, then 

f f, for the ath individual. Then we shall assume usually (1/N ум}, =0 
and (1/N)EA.., f, ў = Ф. UT 

There is a fundamental indeterminacy in this model. Let f — Cf* (f* = 


C^!f) and A* = AC, where C ва nonsingular m X m matrix. Then (1) can 
be written as 


(2) Х= A*f* +U +p. 


When f is random, &f*f*' = C^! (C !)' = @*; when f is nonrandom 
СИМА fzfi' = Ф*. The model with A and f is equivalent to the model 
with A* and f*; that is, by observing X we cannot distinguish between these 
two models. 

Some of the indeterminacy in the model can be eliminated by requiring 
that &ff' =I if f is random, or У, f, fi = NI if f is not random: In this 
case the factors are said to be orthogonal; if ® is not diagonal, the factors 
are said to be oblique. When we assume Ф = І, then Zf*f*' = C (C-1y = 


(I = CC’). The indeterminacy is equivalent to multiplication by an orthogonal 


‚ matrix; this is called the problem of rotation. Requiring that Ф be diagonal 


means that the components óf f are independently distributed when f is 
assumed normal. This has an appeal to psychologists because one idea of 


common mental factors is (by definition) that they are independent or 
uncorrelated quantities. 
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A crucial assumption is that the components of U are uncorrelated. Our 
viewpoint is that the errors of observation and the specific factors are by 
definition uncorrelated. That is, the interrelationships of the test scores are 
caused by the common factors, and that is what we want to investigate. There 
is another point of view on factor analysis that is fundamentally quite 
different; that is, that the common factors are supposed to explain or account 
for as much of the variance of the test scores as possible. To follow this point 
of view, we should use a different model. 

А geometric picture helps the intuition. Consider а p-dimensional Space. 
The columns of A can be considered as т vectors in this space. They span 
some m-dimensional subspace; in fact, they can be considered as coordinate 
axes in the m-dimensional space, and f can be considered as coordinates of 
a point in that space referred to this particular axis system. This subspace is 
called the factor space. Multiplying A on the right by a matrix corresponds to 
taking a new set of coordinate axes in the factor space. 

If the factors are random, the covariance matrix of the observed X is 


(3) х=6(х-и)(Х-и)' = (Af - U(Af-U) = AOA' t V. 
Tf the factors are orthogonal (&ff' = I), then (3) is 
(4) X-AA'-W. 


If f and U are normal, all the information about the structure comes from 
(3) [or (4)] and £X = p. 


14.2.2. Identification 


Given a covariance matrix X and a number m of factors, we can ask whether 
there exist a triplet A, Ф positive definite, and W positive definite and 
diagonal to satisfy (3); if so, is the triplet unique? Since any triplet can be 
transformed into an equivalent structure AC, C^! $C'^!, and W, we can 
put т? independent conditions оп А and Ф to rule out this indeterminacy. 
The number of components in the observable X and the number of condi- 
tions (for uniqueness) is 3p( p + 1) +т?; the numbers of parameters in A, 
Ф, and W are pm, зт(т + 1), and р, respectively. If the excess of observed 
quantities and conditions over number of parameters, namely, 4[(p — т)? 
—p — ті], is positive, we can expect a problem of existence but can anticipate 
uniqueness if a set of parameters does exist. If the excess is negative, we can 
expect existence but possibly not uniqueness; if the excess is 0, we can hope 
for both existence and uniqueness (or at least a finite number of solutions). 
The question of existence of a solution is whether there exists a diagonal 
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matrix W with nonnegative diagonal entries such that X — W is positive 
semidefinite of rank т. Anderson and Rubin (1956) include most of the 
known results on this problem. 

If a solution exists and is unique, the model is said to be identified. As 
noted above, some m? conditions have to be put on A and Ф to eliminate a 
transformation A* = AC and $* = C^! $C'^!. We have referred above to 
the condition Ф = Г, which forces a transformation С to be orthogonal. 
[There are $m(n + 1) component equations in Ф = /.] For some purposes. it 
is convenient to add the restrictions that 


(5) ГЕ. 


is diagonal. If the diagonal elements of Г are ordered and different (у > 
Yn > 77 > Yum) А is uniquely determined. Alternative conditions are that 
the first m rows ОЁ A form a lower triangular matrix. A generalization of this 
condition is to require that the first m rows of BA form a lower triangular 
matrix, where B is given in advance. (This condition is implied by the 
sc-called centroid method.) 


Simple Structure 
These are conditions proposed by Thurstone (1947, p. 335) for choosing a 
matrix out of the class AC that will have particular psychological meaning. If 

а = 0, then the ath factor does not enter into the ith test. The general idea 
of simple structure is that many tests should not depend on all the factors 
when the factors have real psychological meaning. This suggests that, given a 
A, one should consider all rotations, that is, all matrices AC where С is 
orthogonal, and choose the one giving most 0 coefficients. This matrix can be 
considered as giving the simplest structure and presumably the one with most 
meaningful psychological interpretation. It should be remembered that the 
Psychologist can construct his or her tests so that they depend on the 
assumed factors in different ways. 

The positions of the 0’s are not chosen in advance, but rotations C are 
tried until a A is found satisfying these conditions. It is not clear that these 
conditions effect identification. Reiers¢l (1950) modified Thurstone's condi- 
tions so that there is only one rotation that satisfies the conditions, thus 
effecting identification. 


Zero Elements in Specified Positions 

Here we consider a set of conditions that requires of the investigator more 
a priori information. He or she must know that some particular tests do not 
depend on some specific factors. In this case, the conditions are that A;, = 0 
for specified pairs (i, a); that is, that the ath factor does not affect the ith 
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test score. Then we do not assume that Eff’ =1. These conditions are 
similar to some used in econometric models. The coefficients of the ath 
column are identified except for multiplication by a scale factor if (a) there 
are at least m — 1 zero elements in that column and if (b) the rank of A is 
т 1. where A? is the matrix composed of the rows containing the 
assigned 0’s in the ath column with those assigned 0's deleted (i.e., the ath 
column deleted). (See Problem 14.1.) The multiplication of a column by a 
scale constant can be eliminated by a normalization, such as baa = 1 ог 
А, = 1 for some i for each a. И beg =l, а=1,...,т, then ® is a 
correlation matrix. и 

It will be seen that there are т normalizations and a minimum of 
т(т — 1) zero conditions. This is equal to the number of elements of C. If 
there are more than т — 1 zero elements specified in one or more columns 
of A. then there may be more conditions than are required to take out the 
indeterminacy in A C; in this case thc conditions may restrict АФА’. 

As an example, consider thc model 


1 0 
Ay 0 
и 
(6) X=pt Аз Аз MI 
0 Ag 
0 1 
и 
Ало 
=pt AyutAy@ lay 
Аа 
а 


for the scores on five tests, where v and а are measures of verbal and 
arithmetic ability. The first two tests are specified to depend only on verbal 
ability while the last two tests depend only on arithmetic ability. The 


normalizations put verbal ability into the scale of the first test and arithmetic 


ability into the scale of the fifth test. 


Koopmans and Reiersgl (1950), Anderson and Rubin (1956) and Howe 
(1955) suggested the use of preassigned 0’s for identification and developed 
maximum likelihood estimation under normality for this case. [See also 
Lawley (1958).] Jóreskog (1969) called factor analysis under these identifica- | 
tion conditions confirmatory factor analysis; with arbitrary conditions or with; 


rotation to simple structure, it has been called exploratory factor analysis. 
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Other Conditions 

А convenient set of conditions is to require the upper square submatrix of A 
to be the identity. This assumes that the upper square matrix without this 
condition is nonsingular. In fact, if A* = (A*', A*^)' is an arbitrary p Xm 
matrix with A% square and nonsingular, then A = A*A* ! = (І, А!) satis- 
fies the condition. (This specification of the leading т х m submatrix of A 


as Г, is a convenient identification condition and does not imply any 
substantive meaning.) 


14.2.3. Units of Measurement 


We have considered factor analysis methods applied to covariance matrices. 
In many cases the unit of measurement of each component of X is arbitrary. 
For instance, in psychological tests the unit of scoring has no intrinsic 
meaning. 

Changing the units of measurement means multiplying each component of 
X by a constant; these constants are not necessarily equal. When a given test 
score is multiplied by a constant, the factor loadings for the test are 
multiplied by the same constant and the error variance is multiplied by 
square of the constant. Suppose DX — X*, where D is a diagonal matrix with 
positive diagonal elements. Then (1) becomes 


(7) X* = A*f - U* + р, 


where p* = 2Х* = Dp, A* = DA, and U* = рО has covariance matrix 
$ * = DYD. Then 


(8) &(X* ~ p*)(X* — р), = AFDA*' + ФУ», 


where Z* = РУ р. Note that if the identification conditions are Ф = І and 
A‘'W~'A diagonal, then A* satisfies the latter condition. If A is identified 
by specified 0’s and the normalization is by $,, = 1, a=1,...,m (ie, isa 
correlation matrix), then A* — DA is similarly identified. (If the normaliza- 
tion is Àj, = 1 for specified i for cach а, each column of DA has to be 
renormalized.) 

А particular diagonal matrix D consists of the reciprocals of the observ- 
able standard deviations d; = 1/ уо; - Then }* = Ур is the correlation 
matrix. 

We shall see later that thc maximum likelihood cstimators with identifica- 
ton conditions Г diagonal or specified 0’s transform in the above fashion; 
that is, the transformation x* = Dx,, а = 1,..., М, induces Á* - DÁ and 
V*-pip. i 


576 FACTOR ANALYSIS 


14.3. MAXIMUM LIKELIHOOD ESTIMATORS FOR RANDOM 
ORTHOGONAL FACTORS. 


14.3.1. Maximum Likelihood Estimators 


In this section we find the maximum likelihood estimators of the parameters 
when the observations are normally distributed, that is, the factor scores and 
errors are normal [Lawley (1940). Then X = AMA‘ + №. We impose condi- 
tions on A and to make them just identified. These do not restrict 
АФЛ’; it is a positive definite matrix of rank т. For convenience we 
suppose that Ф = 7 (i.e., the factors are orthogonal or uncorrelated) and that 
T-A'W^!A is diagonal. Then the likelihood depends on the mean p and 
У = ЛА’ + Y. The maximum likelihood estimators of A and under some 
other conditions effecting just identifice tion [e.g., A =(Т,, А',)'] are trans- 
formations of the maximum likelihood estimators of А under the preceding 
conditions. If х,,..., хм are a set of № observations on X, the likelihood 
function for this sample is 


1 1 N 
()  L=(2r) "zi ep|-3 E (х, - p) X^ (, - н). 


The maximum likelihood estimator of the mean p is À - x = (1/N)XEN. ix, 
Let 


N 
(2) A= Y (x,—-x)(x, -x)'. 
a=] 
Next we shall maximize the logarithm of (1) with р. replaced by fi; this ist 
(3) —3pN log2a— 5N log X| - зн АХ”. 


(This is the logarithm of the concentrated likelihood.) From X X^! =I, we 
obtain for any parameter 0 


ay! 


(4) 90 —.— -x' 


ax 


SA. 
36 = А 


Then the partial derivative of (3) with regard to V, a diagonal element of 
W, is —N/2 times 


р 
(5) oï- У сос", 
k,j=0 


*We could add the restriction that the off-diagonal elements of А’ А are 0 with Lagrange 
multipliers, but then the Lagrange multipliers become 0 when the derivatives are set equal to 0. 
Such restrictions do not affect the maximum. 
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where X^! =(o/) and (c;) = C =(1/N)A. In matrix notation, (5) set equal 
to 0 yields 


(6) diag У! = diag Ў CX, 


where diag Н indicates the diagonal terms of the matrix Н. Equivalently 
diag £-!(X — C)X -! = diag0. The derivative of (3) with respect to A,. is – М 
times 


p p 
(7) Y o A, 7 L c" c, с“, > 
j=1 h,g,j=1 


К=|,....р, T=1,...,m. 


In matrix notation (7) set equal to 0 yields 


(8) X OASXUCEUA. 
We have 
(9) XWA-Z(AA'4CP)W'OA-AUC-FA-A(YE-I) 


From this we obtain WV ^!A(T +7)! = X ^!A. Multiply (8 by X and use the 
above to obtain : 


(10) А(Г+Г) = CW-!A, 
or 
(11) (C- W)-'A- AT. 


Next we want to show that X^!- €X^'CX !-X (X-O)€^ is 
V7 (E—C)OP^! when (8) holds. Multiply the latter by X on the left and on 
the right to obtain у 


(12) 

Xw(X-C)WZX-(AA'-WV)y)W (W-AA'-C)W"C(AA'- Y) 
=W+AA'-C 

because 


(13) AA'V^'(Y -AA'-C) ЕЛА’ HATA - ААС 
=А[(1+Г) А’ - AWC] 
-0 | 
by virtue of (10). Thus 


(14) X-(X-C)E-w-(X-C)w-!. 
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Then (6) is equivalent to diag V '(X — C)W-' =diag0. Since VW is diago- 
nal. this equation is equivalent to 


(15) diag( A A’ + W) = diag C. 


The estimators А and V are determined by (10), (15), and the requiremert 
that AV 7! S is diagonal. ; l 
We can multiply (11) on the left by W^ ? to obtain 


(16) хус Ww К-ЗА) = (HEADY, 


which shows that the columns of W^ *A are characteristic vectors of 
рес W)w-oi-w-icw-i-1 and the corresponding diagonal ele- 
ments of Г are the characteristic roots. [In fact, the characteristic vectors о 

w-icw-:—J аге the characteristic vectors of wW-7CW ? because 
(P7 $CP? -Dx- yx is equivalent to V^ 1C W7 зх = (1 + y)x.] The vec- 
tors are normalized by (W^ ЗА) (Ч FA) = ATW ТА = Г. The characteristic 
roots are chosen to maximize the likelihood. To evaluate the maximized 


likelihood function we calculate 


(17) C$ = тст -АА’) e 


The third equality follows from (8) multiplied on the left by 2; the fourth 
equality follows from (15) and the fact that W is diagonal. Next we find 


(18) ЕН НА Ann er, |1981 
ПАКИ, 
НР, | 


The second equality is [UU' + 1,| = (U'U + L,| for Up xm, which is proved 
as in (14) of Section 8.4. From the fact that the characteristic roots О 
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№ (C — Ф) ғ are the roots y,» у, > ·-- »y0f0-|C—-w-yW|- 
іс 0 + у), 


іс р n 
(19) Tél HG +%)- 


[Note that the roots 1+ y, of W~2?CW~? are positive. The roots у; of 


W- (C — V) 7? are not necessarily positive; usually some will be negative.] 
Then 


. СІП, (1+9, 
(20) ig. les + 5) — qc 
IE; + $) Tjes(1+ 3) 


where S is the set of indices corresponding to the roots in Ё. The logarithm 
of the maximized likelihood function is 


(21) —$pN log2« — 5 № logiC| — 2N У, log(1 + 4) – 3Np. 
jes 


The largest roots 9, > + > $, should be selected for diagonal elements of 
Г. Then 5 = {1,..., т). The logarithm of the concentrated likelihood (3) is а 
function of X = A A' + V. This matrix is positive definite for every A and 
every diagonal W that is positive definite; it is also positive definite for some 
diagonal ^s that are not positive definite. Hence there is not necessarily a 
relative maximum for W positive definite. The concentrated likelihood 
function may increase as one or more diagonal elements of W approaches 0. 
In that case the derivative equations may not be satisfied for W positive 
definite. 

The equations for the estimators (11) and (15) can be written as polyno- 
mial equations [multiplying (11) by | W|], but cannot be solved directly. There 
are various iterative procedures for finding a maximum of the likelihood 
function, including steepest descent, Newton—Raphson, scoring (using the 
information matrix), and Fletcher-Powell. [See Lawley and Maxwell (1971), 
Appendix II, for a discussion.] 

Since there may not be a relative maximum in the region for which y > 0, 
i — 1,..., p, an iterative procedure may define a sequence of values of A and 
$ that includes i; « 0 for some indices i. Such negative values are inadmis- 
sible because 4; is interpreted as the variance of an error. Опе may impose 
the condition that y; > 0, i = 1,..., p. Then the maximum may occur on the 
boundary (and not all of the derivative equations will be satisfied). For some 
indices i the estimated variance of the error is 0; that is, some test scores are 
exactly lincar combinations of factor scores. If the identification conditions 
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® =] and A’W~'A diagonal are dropped, we can find a coordinate system 
for the factors such that the test scores with 0 error variance can be 
interpreted as (transformed) factor scores. That interpretation does not seem 
useful. [See Lawley and Maxwell (1971) for further discussion.] 

An alternative to requiring d; to be positive is to require y; to be 
bounded away from 0. A possibility is ij; > £d; for some small =, such as 
0.005. Of course, the value of = is arbitrary; increasing = will decrease the 
value of the maximum if the maximum is not in the interior of the restricted 
region, and the derivative equations will not all be satisfied. 

The nature of the concentrated likelihood is such that more than one 
relative maximum may be possible. Which maximum an iterative procedure 
approaches will depend on the initial values. Rubin and Thayer (1982) have 
given an example of three sets of estimates from three different initial 
estimates using the EM algorithm. 

The EM (expectation -maximization) algorithm is a possible computational 
device for maximum likelihood estimation [Dempster, Laird, and Rubin 
(1977), Rubin and Thayer (1982)]. The idea is to treat the unobservable f's as 
missing data. Under the assumption that f and U have a joint normal 
distribution, the sufficient statistics are the means and covariances of the X's 
and f's. The E-step of the algorithm is to obtain the expectation of the 
covariances on the basis of trial values of the parameters. The M-step is to 
maximize the likelihood function on the basis of these covariances; this step 
provides updated values of the parameters. The steps alternate, and the 


procedure usually converges to the maximum likelihood estimators. (See 


Problem 14.3.) 

As noted in Section 14.2, the structure is equivariant and the factor scores 
are invariant under changes in the units of measurement of the observed 
variables X DX, where D is а diagonal matrix with positive diagonal 
elements and A is identified by A’W~'A is diagonal. If we let DA = A*, 
D'Y D = V*, and DCD = C*, then the logarithm of the likelihood function is 
а constant plus a constant times 


(22) — log] V *  AFA*'| - tr C#( W* + AFA*)! 
= —logl Y + A A'| ^ trC( Y +A A")! — 2logiD]. 


The maximum likelihood estimators of A* and W* are A*=DA and 
Y * = DD, and At фед S ADA is diagonal. Tha: is, the estimated 
factor loadings and error variances are merely changed by the units of 
measurement. 

It is often convenient to use 2; = 1/ Veu» so DCD = (r;;) is made up of 
the sample correlation coefficients. The analysis is independent of the units 
of measurement. This fact is related to the fact that psychological test scores 
do not have natural units. 
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The fact that the factors do not depend on the location and scale factors is 
onc reason for considering factor analysis as an analysis of interdependence. 
ít is convenient to give some rules of thumb for initial estimates of the 
coinmunalities, у" А =1- this in terms of observed correlations. One rule 


© is to use the R2, раны...» Another is to usc max, , |71. 


И 14.3.2. Test of the Hypothesis That the Model Fits 


We shall derive the likelihood ratio test that the model fits; that is, that for a 
specified m the covariance matrix can be written as Х = V + A A’ for some 
diagonal positive definite W and some p x m matrix A. The likelihood ratio 


. criterion is 


max, a y (м, W AA"). [CUN p 


max, s L(p, X) — П (1+ 8)? 


(23) = 
I+ A АНУ лент 


because the unrestricted maximum likelihood estimator of X is C, tr сф + 
Â ÂY! =p by (17), and ICI /I$| =14,,,,(1 $7. from (20). The null 
hypothesis is rejected if (23) is too small. We can use — 2 times the logarithm 
of the likelihood ratio criterion: 


(24) -N у log(1-- 5;) 


j=m+t 


and reject the null hypothesis if (24) is too large. 

If the regularity conditions for 4 and A to be asymptotically normally 
distributed hold, the limiting distribution of (24) under the null hypothesis is 
x! with degrees of freedom 3[(p — т)" — p — m], which is the number of 
elements of X plus the number of identifying restrictions minus the number 
of parameters in W and A. Bartlett (1950) suggested replacing № by? 
N — (2p + 11)/6 — 2m/3. See also Amemiya and Anderson (1990). 

. From (15) and the fact that $,,..., Y, are the characteristic roots of 
V-(c- Ф)Ф- we have 


(25) О= пф i(c- x - АА) 


i=l i=! і=т+ 1 


trhis factor is heuristic. If т = 0. the factor from Chapter 9 is N - (2р + 11)/6: Bartlett 
suggested replacing N and p by N — m and p — m. respectively. 
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119] <1 for jom+1....,p, we can expand (24) using (25) as 


` P ^ ^ 
(26) -N X (5-39 +397 oo) aN E (#3 +): 


j=mtl j=m+1 


The criterion is approximately 4N Ef m+1 3}. The estimators Y and A are 
found so that С-Ф — АА’ is small in a statistical sense or, equivalent у, 
so C — Y is approximately of rank т. Then the smallest p- m roots о 
фес — &) > should be near 0. The critzrion measures the deviations 
of these roots from 0. Since $4,,,,..., 9p are the nonzero roots O 
$- «C — $) 3, we see that 


rol — 


(27) 


ter^ (c7 $) 1 (0- $) 


2 
-5 (ci = Gis) 
i<j Vii 


se the diagonal elements of C — X are 0. | 

Pa many situations the investigator does not know a value of m to 
hypothesize. He or she wants to determine the smallest number of actor 
such that the model is consistent with the data. It is customary to ts 
successive values of m. The investigator starts with a test that the number о 
factors is a specified ту (possibly 0 or 1). If that hypothesis is rejected, one 
proceeds to test that the number is my + 1. One continues in that fashion 
until a hypothesis is accepted or until 4[(p = т) -р- т) 50. In ће ^ 
event one concludes that no nontrivial factor model fits. Unfortunately, 
the probabilities of errors under this procedure are unknown, even asymptot- 
icallv. 


14.3.3. Asymptotic Distributions of the Estimators 


The maximum likelihood estimators А and % maximize the average сопот. 
trated log likelihood functions L*(C, А“, w*) given by (3) divided by N or 
Y* = + A*A*', subject to A*'V* ^ A* being diagonal. t C is a consis- 
tent estimator of £ (the "true" covariance matrix), then I*(C, A*,W*)2 
[CU +A A’, A*, W*) uniformly in probability in a neighborhood of А, V, 
and L*(W + A A’, A*, W*) has a unique maximum at *V* = VY and A DA 
Because the function is continuous, the A*,P* that maximize 
L*(GC. А*. V *) must converge stochastically to A, Y. 
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Theorem 14.3.1. If A and W are identified by A’'W~'A_ being diagonal, if 


the diagonal elements are different and ordered, and if C ^ XV + A А’, then 
baw and ASA. 


A sufficient condition for C ^ X. is that (f' U’)' has a distribution with 
finite second-order moments. 

The estimators A and $ are the solutions to the equations (10), (15), and 
the requirement that A’W~'A is diagonal. These equations are polynomial 
equations. The derivatives of A and $ as functions of C are continuous 
unless they become infinite. Anderson and Rubin (1956) investigated condi- 
tions for the derivative to be finite and proved the following theorem: 


Theorem 14.3.2. Let 
(28) (6,) 6 2 W - A(A'71A)7! A’, 


If (92) is nonsingular, if A and W are identified by the condition that А’ -1А 
is diagonal and the diagonal elements are different and ordered, if C 5» Ф + A А’, 
and if VN(C — X) has a limiting normal distribution, then VN(A — A) and 
VN Cbr — №) have a limiting normal distribution. 


For example, УМ (C — X) will have a limiting distribution if (f’ U^)' has а 
distribution with finite fourth moments. 

The covariance matrix of the limiting distribution of VN (À — A) and 
VN Cr — №) is too complicated to derive or even present here. Lawley (1953) 
found covariances for VN (Â — A) appropriate for V known, and Lawley 
(1967) extended his work to the case of Y estimated. [See also Lawley and 
Maxwell (1971).] Jennrich and Thayer (1973) corrected an error in his work. 
The covariance of YN (i; — 4.) and VN Gh; — j) in the limiting distribu- 
tion is 
(29) 2923 6", і,ј=1,...,р, 


where ( £/) = (02)-!. The other covariances аге too involved to give here. 

While the asymptotic covariances are too complicated to give insight into 
the sampling variability, they can be programmed for computation. In that 
case the parameters are replaced by their consistent estimators. 


14.3.4. Minimum-Distance Methods 


An alternative to maximum likelihood is generalized least squares. The 
estimators are the values of W and A that minimize 


(30) tr(C -Z)H(C - X)H, 
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where X —W--AA'and Н=У-! or some consistent estimator of $~! 
When H = X^, the objective function is of the form 


(31) [c- e C, A)]'[cove] [e - o (V, A)], 


where c represents the elements of C arranged in a vector, o( ¥, A) is 
W+AA’ arranged in a corresponding vector, and соус is the covariance 
matrix of с under normality [Anderson (1973a)]. Jóreskog and Goldberger 
(1972) use C^! for H and minimize 

(32) (C= Х)С (C-X)C = Уст). 


The matrix of derivatives with respect to the elements of A set equal to 0 
forms the matrix equation 


(33) C'(C-X)C !A = 0. 
This can be rewritten as 
(34) А = ЎСТА. 


Multiplication on the left by X 'СУ-! yields (8\, which leads to (10). This 
estimator of A given W is the same as the maximum likelihood estimator 
except for normalization of columns. The equation obtained by setting the 
derivatives of (32) with respect to W equal to 0 is 


(35) diag C^! (ЛА) - C]C^! = баро. 
An alternative is to minimize 

(36) (САА) [C7 (9 - A A)]]. 

This leads to (8) or (10) and 

(37) diag X СУ (С- 5) 5 -! = diag0. 


Browne (1974) showed that the generalized least squares estimator of W has 
the same asymptotic distribution as the maximum likelihood estimator. Dahm 
and Fuller (1981) showed that if cove in (31) is replaced by a matrix 
converging to cove and W, Л, and depend on some parameters, then the 
asymptotic distributions are the same as for maximum likelihood. 


14.3.5. Relation to Principal Component Analysis 


What is the relation of maximum likelihood to the principal component 
analysis proposed by Hotelling (1933)? As explained in Chapter 11, the vector 
of sample principal components is the orthogonal transformation B'X, where 
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th: columns of B are the characteristic vectors of C normalized by B'B = Г. 
Then 


171719 


р 
(38) С = ВТВ' = У ЬЫ 
i=l 


t,, the charac- 


tpo 


where Т is the diagonal matrix with diagonal elements 1,,.. 


teristic roots of C. If £,,,,,..., t, are small, С can be approximated by 
(39) B,T,B; — У; БЫ, 
is] 
where T, is the diagonal matrix with diagonal elements 1,,...,7,, and X is 


approximated by 
т 
(40) B, BX = Y, b(b;X). 
i=l 
Then the sample covariance of the difference between X and the approxima- 
tion (40) is the sample covariance of 


(41) Х- B, Bi X= В, ВХ, 


which is В.Т, B; = УР b;t;b;, and the sum of the variances of the compo- 


т=ет+і 01700 
nents is УР „+4. Неге T, is the diagonal matrix with 1,,,,...,:, as 
diagonal elements. 

This analysis is in terms of some common unit of measurement. The first 
m components "explain" a large proportion of the “variance,” tr C. When the 
units of measurement are not the same (e.g., when the units are arbitrary), 
it is customary to standardize each measurement to (sample) variance 1. 
However, then the principal cemponents do not have the interpretation in 
terms of variance. 

Another difference between principal component analysis and factor anal- 
ysis is that the former does not separate the error from the systematic part. 
This fault is easily remedied, however. Thomson (1934) proposed the follow- 
ing estimation procedure for the factor analysis model. A diagonal matrix W 
is subtracted from C, and the principal component analysis is carried out on 
C- V. However, W is determined so C — V is close to rank т. The 
equations are 


(42) (C—-W)A-AL, 
(43) - diag( W + A A) = diag C, 
(44) A' A — L diagonal. 


The last equation is a normalization and takes out the indeterminacy in A. 
This method allows for the error terms, but still depends on the units of 
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measurement. The estimators are consistent but not (asymptotically) efficient 
in the usual factor analysis model. 


14.3.6. The Centroid Method 


Before the availability of high-speed computers, the centroid method was 
used almost exclusively because of its computational ease. For the sake of 
history we give a sketch of the method. Let R* be the correlation reduced 
matrix, that is, the matrix consisting of г, i +j, and 1 — Vi, where фу: is an 
initial estimate of the error variance in standard deviation units. Thomson’s 
principal components approach is first to find the m characteristic vectors of 
R, = R* corresponding to the m largest characteristic roots. As indicated A 
Chapter 11, one computational method involves starting with an initia 
estimate of the first vector, say x, calculating x‘? = Вох ) and iterating. At 
the rth step x’ is approximately y,x~”, where y, is the largest root and 
х S yix DD, Then у, =x0O/ y yx TD’ xCD is approximately 
the first characteristic vector normalized so y; y; = Yı To ootain the second 
vector, apply the same procedure to R, — R* — yy mE 

The centroid method can be considered as a very rough approximation to 
the principal component approach. With psychological tests the correlation 
matrix usually consists of positive entries, and the first characteristic vector 
has all positive components, often of about the same value. The centroid 
method uses Е = (1,...,1) as the initial estimate of the first vector. Then 
В*Е=х( is the first iterate and should be an approximation to the first 
characteristic vector. An approximation to the first characteristic root is 
e'R*e/e’e. Then y, = х0 / Ve'R*e is an approximation to the first charac- 
teristic vector of R* normalized to have length squared y,. The operations 
can be carried out on an adding machine or on a desk calculator because 
R*e amounts to adding across rows and e’R*e is the sum of those row 

als. 
te second characteristic vector is orthogonal to the first. А vector 
orthogonal to e is e* consisting of p/2 l's and p/2 - Vs. Then Rie = х2 is 
an approximation to the second characteristic vector, and e" R,e*/g*'g 
approximates the second characteristic root. These operations involve chang- 
ing signs of entries of R, and adding. The positions of the —1’s in =* are 
selected to maximize e*'R,e*. The procedure can be continued. 


14.4. ESTIMATION FOR FIXED FACTORS 


Let x, = G4, ..., Хра) be an observation on X, given by 


(1) X = А} +р+0, 
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with f, being a nonstochastic vector (an incidental parameter), а = 1,..., № 
satisfying X^. , f, = 0. The likelihood function is 


, 


24 
1 ? 1 N (xi, = Mi ХА Ла} 
xz llexp 7 У : —] 
«=l 


2) L= —— 
e [27)'TI2., ul i=l Ja 


This likelihood function does not have a maximum. To show this fact, let 
My 0, A =1, Aj 70 G D, fi, =. Then х, ш У Ayia = 0, 
and y, does not appear in the exponent but appears only in the constant. 
As фи 0, L -э со. Thus the likelihood does not have a maximum, and the 
maximum Jikelihood estimators do not exist [Anderson and Rubin (1956)]. 
Lawley (1941) set the partial derivatives of the likelihood equal to 0, but 
Solari (1969) showed that the solution is only a stationary value, not a. 
maximum. 

Since maximum likelihood estimators do not exist in the case of fixed 
factors, what estimation methods can be used? One possibility is to use the 
maximum likelihood method appropriate for random factors. It was stated by 
Anderson and Rubin (1956) and proved by Fuller, Pantula, and Amemiya 
(1982) in the case of identification by 0’s that the asymptotic normal distribu- 
tion of the maximum likelihood estimators for the random case is the same as 
for fixed factors. 

The sample covariance matrix under normality has the noncentral Wishart 
distribution [Anderson (1946a)] depending on W, ЛФЛ’, and N — 1. Ander- 
son and Rubin (1956) proposed maximizing this likelihood function. How- 
ever, one of the equations is difficult to solve. Again the estimators are 


asymptotically equivalent to the maximum likelihood estimators for the 
random-factor case. 


14.5. FACTOR INTERPRETATION AND TRANSFORMATION 


14.5.1. Interpretation 


The identification restrictions of A'W ^'A diagonal or the first m rows of A 
being J,, may be convenient for computing the maximum likelihood estima- 
tors, but the components of the factor score vector may not have any intrinsic 
meaning. We saw in Section 14.2 that 0 coefficients may give meaning to a 
factor by the fact that this factor does not affect certain tests. Similarly, large 
factor loadings may help in interpreting a factor. The coefficient of verbal 
ability, for example, should be large on tests that look like they are verbal. 

In psychology each variable or factor usually has a natural positive direc- 
tion: more answers right on a test and more of the ability represented by the 
factor. It is usually expected that more ability leads to higher performance; 
that is, the factor loading should be positive if it is not 0. Therefore, roughly 
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№2 


^ ^ 
(Aya, A12) 


^ ^ 
*(A21, Aza) 
^ ^ 
*(A31, Aga) 


^ ^ 
*(41, А2) 


^ ^ 
= *(À51, АБО) 


Figure 14.1. Rows of А. 


speaking, for the sake of interpretation, one may look for factor loadings that 
are either 0 or positive and large. 


14.5.2. Transformations 


The maximum likelihood estimators on the basis of some arbitrary identifica- 
tion conditions including Ф = Г аге А and &. We consider transformations 


(1) Â*= ÂP, Ó*-p(py-(PP)! 

If the factors are to be orthogonal, then Ф* = Г and P is orthogonal. If the 
factors are permitted to be oblique, P can be an arbitrary nonsingular matrix 
and $* an arbitrary positive definite matrix. 

The rows of Á can be plotted i in an m-dimensional space. Figure 14.1 is a 
plot of the rows of a 5 x 2 matrix А. The coordinates refer to factors and the 
points refer to tests. If Ф“ is required to be І, we are seeking a rotation of 
coordinate axes in this space. In the example that is graphed, a rotation of 
45° would put all of the points into the positive quadrant, that is, Aj; > 0. One 
of the new coordinates would be large for each of the first three points and 
small for the other two points, and the other coordinate would be small for 
the first three and large for the last two. The first factor is representative of 
what is common to the first three tests, aad the second factor of what is 
common to the last two tests. 

lf m > 2, a general rotation can be approximated manually by a sequence 
of tvio-dunensional rotations. 
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If Ф" is not required to be J,,, the transformation P is simply nonsingu- 
lar. If the normalization of the jth column of A is Ме, = 1, then 
(2) 1= Moy je È Ар, kPrjs 


each column of P satisfies such a constraint. If the normalization is $; = 1, 
then 


m 


(3) 1= = У (р^), 
k= 


where (p/*)= Р-!. 

Of the various computational procedures that are based on optimizing an 
objective function, we describe the varimax method proposed by Kaiser 
(1958) to be carried out on pairs of factors. Horst (1965), Chapter 18, 
extended the method to be done on all factors simultaneously. A modified 
criterion is 


(хе. Aj i) 


2 
Dp Ate m 
р 


т р 
(4 ST - 5] Da 

j-li j=l i21 
which is proportional to the sum of the column variances of the squares of 
the transformed factor loadings. The orthogonal matrix P is selected so as to 
maximize (4). The procedure tends to maximize the scatter of AP within 
columns. Since х? > 0, there is a tendency to obtain some large loadings and 
some near 0. Kaisers original criterion was (4) with AS replaced by 
М2 De Me 

Lawley and Maxwell (1971) describe other criteria. One of them is a 

measure of similarity to a predetermined p x m matrix of 175 and 0’s. 


14.5.3. Orthogonal versus Oblique Factors 


In the case of orthogonal factors the components are uncorrelated in the 
population or in the sample according to whether the factors are considered 
random or fixed. The idea of uncorrelated factor scores has appeal. Some 
psychologists claim that the orthogonality of the factor scores is essential if 
one is to consider the factor scores more basic than the test scores. Consider- 
able debate has gone on among psychologists concerning this point. On the 
other side, Thurstone (1947), page vii, says “И seems just as unnecessary to 
require that mental traits shall be uncorrelated in the general population as 
to require that height and weight be uncorrelated in the general population." 

As we have seen, given a pair of matrices A, Ф, equivalent pairs are given 
by A P, P^!  P' ! for nonsingular Р’з. The pair may be selected (i.e.. the P 
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given А.Ф) as the one with the most meaningful interpretation in terms of 
the subject matter of the tests. The idea of simple structure is that with 6 
factor loadings in certain patterns the component factor scores can be given 
meaning regardless of the moment matrix. Permitting to be an arbitrary 
positive definite matrix allows more 0’s in A. 

Another consideration in selecting transformations or identification condi- 
tions is autonomy, or permanence, Or invariance with regard to certain 
changes. For example, what happens if a selection of the constituents of a 
population is made? In case of intelligence tests, suppose a selection is made, 
such as college admittees out of high school seniors, that can be assumed to 
involve the primary abilities. One can envisage that the relation between 
unobserved factor scores f and observed test scores x is unaffected by the 
selection, that is, that the matrix of factor loadings A is unchanged. The 
variance of the errors (and specific factors), the diagonal elements of W, may 
also be considered as unchanged by the selection because the errors are 
uncorrelated with the factors (primary abilities). 

Suppose there is a true model, A,®,W, and the investigator applies 
identification conditions that permit him to discover it. Next, suppose there is 
a selection that results in a new population of factor scores so that their 
covariance matrix is Ф*. When the investigator analyzes the new observed 
covariance matrix V + A®*A’, will he find A again? If part of the identifi- 
cation conditions are that the factor moment matrix is Г, then he wiil obtain 
a different factor loading matrix. On the otter hand, if the identification 
conditions are entirely on the factor loadings (specified 0's and 1’s), the factor 
loading matrix from the analysis is the same as before. 

The same consideration is relevant in comparing two populations. It may 
be reasonable to consider that V, = Ҹ,, A, = A», but Ф, + Ф,. To test the 
hypothesis that Ф, = Ф,, one wants to use identification conditions that 
agree with A, =A, (rather than A, = А,С). The condition should be on 
the factor loadings. 

What happens if more tests are added (or deleted)? In addition to 
observing X=Af+p+U, suppose one observes X* = At f+ ре + 0%, 
where U* is uncorrelated with U. Since the common factors f are un- 
changed, ® is unchanged. However, the (arbitrary) condition that AYTA 
is diagonal is changed; use of this type of condition would lead to a rotation 
of tA? A*?). 


14.6. ESTIMATION FOR IDENTIFICATION BY SPECIFIED ZEROS 


We now consider estimation of A, №, and Ф when Ф is unrestricted and А 
is identified by specified 0’s and 1’s. We assume that each column of A has at 
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least m + 1 0’s in specified positions and that the submatrix consisting of the 
rows of A containing the 075 specified for a given column is of rank т — 1. 
(See Section 14.2.2.) We further assume that each column of A has 1 in a 
specified position or, alternatively, that the diagonal element of Ф corre- 
sponding to that column is 1. Then the model is identified. 

The likelihood function is given by (1) of Section 14.3. The derivatives of 
the likelihood function set equal to 0 are 


(1) diag X"! [C (Vr + АФЛ’): = diag0, 
(2) A'X[C- (+ A6A')]X71A-0 

for positions in Ф that are not specified, and 

(3) X-[C- (V - AA] ZA -0. 
for positions in A not specified, where 

(4) X-WrAOA. 


These equations cannot be simplified as in Section 14.3.1 because (3) holds 
only for unspecified positions in A, and hence one cannot multiply by X on 
the left. [See Howe (1955), Anderson and Rubin (1956), and Lawley (1958).] 

These equations are not useful for computation. The likelihood function, 
however, can be maximized nume-ically. 

As noted before, a change in units of measurement, X* — DX, results in a 
corresponding change in the parameters A and W if identification is by 0 in 
specified positions of A and normalization is by ój = 1, j = 1,...,m. It is 
readily verified that the derivative equations (1), (2), (3), and (4) are changed 
in a corresponding manner. 

Anderson and Amemiya (1988a) have derived the asymptotic distribution 
of the estimators under general conditions. Normality of the observations is 
not required. See also Anderson and Amemiya (1988b). 


14.7. ESTIMATION OF FACTOR SCORES 


It is frequently of interest to estimate the factor scores of the individuals in 
the group being studied. In the model with nonstochastic factors the factor 
scores are incidental pararieters that characterize the individuals. As we 
have seen (Section 14.4), the maximum likelihood estimators of the parame- 
ters (W, А, р, fi... , fy) do not exist. We shall therefore study the estima- 
tion of the factor scores on the basis that the structural parameters (W, A и) 
are known. 
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When f, is considered as an incidental parameter, x, ~ р. is an observa- 
tion from a distribution with mean A f, and covariance matrix W. The 
weighted least squares estimator of f, is 


(1) f, (АУТА) AW, p.) 
= ГТА у(х, р), | 


where Г = A'3V7!A. (not necessarily diagonal). This estimator is unbiased 
and its covariance matrix is 


(2) &(f.-f.)f - f) (A IA) Г 


by the usual generalized least squares theory [Bartlett (1937Ъ), (1938)]. It is 
the minimum variance unbiased linear estimator of f,. If x, is normal, the 
estimator is also maximum likelihood. 

When f, is considered random [Thomson (1951)], we suppose X, and f, 
have a joint normal distribution with mean vector (q&', 0')' and covariance 
matrix 


o (rum v) 


Then the regression of f on X (Section 2.5) is. 

(4) é(flX) =@A'(W+ ADA’) '(x— р) 
=Ф(Ф+ФГФ) oA'v-!(x—p). 

The estimator or predictor of f, is 

(5) Ў = Ф(Ф + ФГФ) "ФЛ (x, в). 

If Ф = І, the predictor is 

(6) fe = (1+Г) AP (x, в). 


When Г is also diagonal, the jth element of (6) is ү/а + у) times the jth 
element of (1). In the conditional distribution of x, given f, (for Ф = 7) 


(7) &(ftlf.) = (1+Г) rf, 
(8) EÊ) = Q+ гг)", 


PROBLEMS 593 
9 ER -SASE Fu) Ve] = PY SAAT) 

(10) E (SÈ -IA fE -Fy =E. 

This last matrix, describing the mean squared error, is smaller than (2) 


describing the unbiased estimator. The estimator (5) or (6) is а Bayes 
estimator and is appropriate when f, is treated as random. 


Cl'ROBLEMS 


14.1. (Sec. 14.2) Identification by O's. Let 


0 AU би Ср | 
= С = P 
^ Ю Ay [cr С>, | 


implies 


if and only if AU is of rank m — 1. 
14.2. (Sec. 14.3) For p —5, m = 1. and А = А. prove |051 = ПА КАИ. 
14.3. (Sec. 14.3) The EM algorithm. 


(a) If f and U are normal and f and X are observed, show that the likelihood 
function based оп (х, fj), ..., (xy, fy) is 


Nhe 


П 1 exp Y (х. = щт cur 
«=l (2a )*TI£., Wi; i=l Vii 


1 


чт exp[ - i9 'f,] |. 
(27 )?"19: 
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(b) Show that when the factor scores are included as data the sufficient set of 
statistics is x, f, С = C, 


N - - 
сия У -P-Y 


(e) Show that the conditional expectations of the covariances in (b) given 
X — (x,,..., xy), A, Ф, and W аге 


Ch = (CX, A, 0, Y) = C. 

C*- é(C,AX. A, D, W) = С, (У HABA) A0, 

CH= (С/Х, А,Ф, У) - ФА (У + АФЛ) C (У + AGA) AD 
t0 ФА (Y -AOA') АФ. 


(d) Show that the maximum likelihood estimators of A апа W given Ф = I are 


Á 
WH CE - CYC C. 


CHAPTER 15 


Patterns of Dependence; 
Graphical Models 


15.1. INTRODUCTION 


An emphasis in multivariate statistical analysis is that several measurements 
on a number in individuals or objects may be correlated, and the methods 
developed in this book take account of that dependence. The amount of 
association between two variables may be measured by the (Pearson) correla- 
tion of them (a symmetric measure); the association between one variable 
and a set may be quantified by a multiple correlation; and the dependence 
between one set and another set may be studied by criteria of independence 
such as studied in Chapter 9 or by canonical correlations. Similar measures 
can be applied in conditional distributions. Another kind of dependence 
(asymmetrical) is characterized by regression coefficients and related mea- 
sures. In this chapter we study models which involve several kinds of 
dependence or more intricate patterns of dependence. | 

A graphical model in statistics is а visual diagram in which observable 
variables are identified with points (vertices or nodes) connected by edges and 
an associated family of probability distributions satisfying some indepen- 
dences specified by the visual pattern. Edges may be undirected (drawn as 
line segments) or directed (drawn as arrows). Undirected edges have to do 
with symmetrical dependence and independence, while directed edges may 
reflect a possible direction of action or sequence in time. These indepen- 
dences may come from a priori knowledge of the subject matter or may 
derive from these or other data. Advantages of the graphical displav include 
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ease of comprehension, particularly of complicated patterns, ease of elicita- 
tion of expert opinion, and ease of comparing probabilities. | 

Use of such diagrams goes back at least to the work of the geneticist 
Sewall Wright (1921), (1934), who used the term “path analysis.” An elabo- 
rate algebra has been developed for graphical models. Specification of 
independences reduces the number of parameters to be determined. Some of 
these independences are known as Markov properties. In a time series analysis 
of a Markov process (or order 1), for example, the future of the process is 
considered independent of the past when the present is given; in such a 
model the correlation between a variable in the past and a variable in the 
future is determined by the correlation between the present variable and the 
variable of the immediate future. This idea is expanded in several ways. 

The family of probability distributions associated with a given diagram 
depends on the properties of the distribution that are represented by the 
graph. These properties for diagrams consisting of undirected edges (known 
as undirected graphs) will be described in Section 15.2; the properties for 
diagrams consisting entirely of directed edges (known as directed graphs) in 
Section 15.3; and properties of diagrams with both types of edges in Section 
15.4. The methods of statistical inference will be given in Section 15.5. 

In this chapter we assume that the variables have a joint nonsingular 
normal distribution; hence, the characterization of a model is in terms of the 
covariance matrix and its inverse, and functions of them. This assumption 
implies that the variables are quantitative and have a positive density. The 
mathematics of graphical models may apply to discrete variables (contingency 
tables) and to nonnormal quantitative variables, but we shall not develop the 
theory necessary to include them. 

There is a considerable social science literature that has followed Wright's 
original work. For recent reviews of this writing see, for example, Pearl 
(2000) and McDonald (2002). 


152. UNDIRECTED GRAPHS 


A graph is a set of vertices and edges, С = (V, E). Each vertex is identified 
with a random vector. In this chapter the random variables have a joint 
normal distribution. Each undirected edge is a line connecting two vertices. It 
is designated by its two end points; (u,v) is the same as (v,u) in an 
undirected graph (but not in directed graphs). 

Two vertices connected by an edge are called adjacent; if not connected by 
an edge, they are called nonadjacent. In Figure 15.1(a) all vertices аге 
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а. б а / . a /. a /N 
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(а) (b) (с) (9) 
Figure 15.1 


nonadjacent; in (b) a and b aie adjacent; in (c) the pair a and b and the pair 
a and c are adjacent, in (d) every pair of vertices are adjacent. 

The family of (normal) distributions associated with G is defined by a set 
of requirements on conditional distributions, known as Markov properties. 
Since the distributions considered here are normal, the conditions have to do 
with the covariance matrix X and its inverse A = X^!, which is known as the 
concentration matrix. However, many of the lemmas and theorems hold for 
nonnormal distributions. We shall consider three definitions of Markov and 
then show that they are equivalent. 


Definition 15.2.1. The probability distribution on a graph is pairwise 
Markov with respect to G if for every pair of vertices (м, v) that are not adjacent 
X, and X, are independent conditional on all the other variables in the graph. 


In symbols 
(1) X, L ХХу ио 


where 1 means independence and Vu, v) indicates the set V with u and о 
deleted. The definition of pairwise Markov is that p, yq, = 9 for all pairs 
for which (и, v) € E. We may also write u 1. v| VN (и, v). 

Let X and А = €^! be partitioned as 


X44 У ав _ | Лад Aun 
Ce CB 


5 


where А and B are disjoint sets of vertices. The conditional disuibution of 
X, given X, is 


(3) М(Х „ХЗ, Хв, £44 Eas hoa ga): 
The condition И covariance matrix is 


(4) Ў ев = Хдл -Xas Epp Хва = Али. 
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If A = (1.2) and В = (3,..., p), the covariance of X, and X, given X5,..., X, 
iS 813.3... g In Lag = (0;;.3...p)- This is 0 if and only if Ay. = 0; that is, X 4.5 
is diagonal if and only if А 44 is diagonal. 


Theorem 15.21. If a distribution on a graph is pairwise Mari:ov, À;; = 0 for 
(i,  €V. 


Definition 15.2.2. The boundary of a set A, termed bd(A), consists of 
those vertices not in A that are adjacent to A. The closure of A, termed clC AD, is 
A U bd(A). ) 


Definition 15.23. A distribution on a graph is locally Markov if for every 
vertex v the variable X , is independent of the variables not in СКВ) conditional on 
the boundary of v: in notation, 


(5) x, Xy at Ха. 


Theorem 15.2.2. The conditional independences 


(6) X LYZ, X 4L Z|Y 
hold if and only if 
(7) X (У, Z). 


Proof. The relations (6) imply that the density of X, Y, and Z can be 
written as 


(8) f(x y, =) = (ха) 8 (18) ( 2) 
= (ху) (у) т(у). 


Since gCylz)h(z) = nCy, 2) = Uzly)m(y), (8) implies f(xiz) = КОДУ), which in 
turn implies f(xlz) = k(xly) = р(х). Hence 


(9) f(x yz) =р(х)п(у, 2), 


which is the density generating (7). Conversely, (9) can be written as either 
form in (8), implying (7). " 


Corollary 15.2.1. The relations 


(10) X 1L YIZ,W, X L Z|Y,W 
hold if and only if 
(11) X L(Y, Z)W. 


NE T ANE 


a 
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The relations in Theorem 15.2.2 and Corollary 15.2.1 are sometimes called 
the block independence theorem. They are based on positive densities, that is, 
nonsingular normal distributions. 


Theorem 15.2.3. А locally Markov distribution on a graph is pairwise 
Markov. 


Proof. Suppose the graph is locally Markov (Definition 15.2.3). Let и and 
v be nonadjacent vertices. Because v is not adjacent to и, it is not in bd(u); 
hence, 
(12) X, l Ху at) Xác 
The relation (12) can be written 
(13) X, а (X, Xy; tuv. vau} bd( 4). 
Then Corollary 15.2.1 (X 2 X,, Y 2 X,, Z= Zy tauo И =X bau) implies 


(14) X, L XX t o а 


Theorem 15.2.4. А pairwise Markov distribution on a graph is locally 
Markov. 


Proof. Let VNcl(u) =v, U -- U v. Then 
(15) ulu,|bd(u) Uv, U = уо, и ll v|bd(u) Uv, Uv, Us Uu, 


which by Corollary 15.2.1 implies 


(16) ull v, Uv bd(u) U v Us Uu 


РЫ 


Further, (16) and 


(17) ии 0:1 bd(u) Uv, Uv, UU, U Un, 
imply 
(18) u dL v, Uv, U Dy (и) Uv, U - Ut, 


This procedure leads to 
(19) u lu U--Uv,bd(u). и 


А third notion of Markov, namely, global, requires some definitions. 
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Definition 15.2.4. А path from B to C isa Sequence Vg, Uj, U5,...,U0, Of 
adjacent vertices with vy € B and v, € C. 


Definition 15.2.5. 4 set S separates sets B and C if S, B, and C are 
disjoint and every path from B to C intersects S. 


Thus 5 separates B and C if for every sequence of vertices vo, vi, ..., Up 


with uy € B and v, € C at least one of U,...,U,., is a vertex in S. Неге B 
and/or C are nonempty, but 5 can be empty. 


Definition 15.2.6. А distribution on a graph is globally Markov if for every 
triplet of disjoint sets 5, B, and C such that 5 separates B and C the vector 
variables Xy and Xç are independent conditional on X 5. 


In the example of Figure 15.1(c), a separates b and c. If Рьс.а = 0, that is, 
Poe — Pha Pac = 0, the distribution is globally Markov. Note that a set of 
vertices is identified with a vector of variables. 

The global Markov property puts restrictions on the possibie (normal) 
distributions, and that implies fewer parameters about which to make infer- 
ences. 


Suppose V — A U BU $, where A, В, and S are disjoint. Partition X and 
A = Ў -!, the concentration matrix, as 


Ала Aas Ал] [Eaa Eas Fasl” 
(29) А= [Ава Ass Аьѕ |= |5, Lan Ув 
As, Ав Ass Zsa Ess Ess 


The conditional distribution of (X^, X5)' given X s is normal with covariance: 
matrix 


X4 XB Zas 
2 = — -1 
( 1) Уд. Bys zt У вв | В Ух [X;, Zsa] 
= A 44 А AB т 
А ва А вв 


Theorem 15.2.5. Jf S separates A and B та graph with a globally Markov 
distribution, А „в = 0. 


Proof. Because 5 separates А and B, every element и of А and every 
element v of B are nonadjacent, for otherwise the path (м, v) would connect 
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A and B without intersecting 5. The globally Markov property is that X, 
and X, are uncorrelated in the conditional distribution, implying that X 
is block diagonal and hence that A,, = 0. и 


(л. BS 


Theorem 15.2.6. А distribution on a globally Markov graph is pairwise 
Markov. 


Proof. Let the set B be i, the set C be j not adjacent to i, and the set А 
the rest of the variables. Any path from B to C must include elements of 4. 
Hence i is independent of j in the distribution conditioned on the other 
variables. и 


Theorem 15.2.7. A globally Markov family of distributions on a graph is 
locally Markov. 


Proof. The boundary of a set B separates B and VN cl( B). a 


Theorem 15.2.8. А pairwise Markov family of distributions оп a graph is 
globally Markov. 


Proof. Let A, B, and $ be disjoint sets in a pairwise Markov graph such 
that 5 separates A and В. Let #(S) and #(V) denote the numbers of 
vertices in $ and V, respectively. If #(V) = #(S) + 2, that is, V=AUBUS. 
then there must be one vertex in each of A and B, and the pairwise Markov 
property is exactly the globally Markov property. The rest of the proof is a 
backward induction on #(S). Suppose #(V) — #($) > 2 and V=AUBUS. 
Then either A or B or both have more than one vertex. Suppose A has more 
than one vertex, and let и є А. Then SUu separates A\u and B, and 
SUA separates и and B. By the induction hypothesis 


(22) Х.Х (Х.Х), X, L Xal (Xs, Xau) 
By Corollary 15.2.1 
(23) X, L X5lX;. 


Now suppose A UB US CV., Let u € VN(AUB US). Then SU u separates 
A and B. By the induction hypothesis 


(24) X, L X5l( X,, X,). 


Also, either AUS separates и and B or BUS separates A and и. 
(Otherwise there would be a path from B to и and from и to A that would 
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not intersect S.) И AUS sepa ates и and. B, 

25) X, L X5l( Xs, X,). 
Then Corollary 15.2.1 applied to (19) and (20) implies 
(26) (X4. X,) L ХыХ., 


from which we derive X, IL X Ху. и 


Theorems 15.2.3, 15.2.5, and 15.2.6 show that the three Markov properties 
are equivalent: any one implies the other two. The proofs here hold fairly 
generally, but in this chapter a nonsingular multivariate normal distribution is 
assumed: thus all densities are positive. 


Definition 15.2.7. A graph G = (V, E) is complete if and only if every two 
vertices in V are adjacent. 


The definition implies that the graph specifies no restriction on the 
covariance matrix of the multivariate normal distribution. 

A subset АСИ induces a subgraph С. = (A, E), where the edge set E, 
includes all edges (u,v) of С with (u,v) € E, where u€A and vEA. 
A subset of a graph is complete if and only if every two vertices in A are 
adjacent in Ё. 


Definition 15.2.8. A clique is a maximal complete set of vertices. 


~ Maximal” means that if another vertex from И is added to the set, the set 
will no longer be complete. A clique can be constructed by starting with one 
' vertex, say о. If it is not adjacent to any other vertex, v, alone constitutes a 
clique. If v, is adjacent to v, Ко, о.) € E), continue constructing a clique 
with о and v. in it until a maximal complete subset is obtained. Thus every 
vertex is a member of at least one clique, and every edge is included in at 
least one clique. 


Lemma 15.2.1. If the distribution of X, is Markov, it is determined by the 
set of marginal distributions of all cliques. 


In Figure 15.1(a) each of a, b,c is a clique; in (b) each of (a, b) and c is a 
clique; in (c) each of (а, b) and (a,c) is a clique; in (d) (a, b, c) is a clique. 


Definition 15.2.9. The density (Ху) factorizes with respect to С if there 
are nonnegative functions в (Хе) depending on the complete subgraphs such that 


(27) КХ,) = II вхо). 


C complete 
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Since it suffices to consider only cliques, an alternative factorization is 


(28) f(Xy) = П gc*( Xe). 


C* cliques 


These. functions gc(Xc) and &с+(Хс») are not necessarily densities ог 
conditional densities. The problems of statistical inference may be reduced to 
the problems of the complete subgraphs or cliques. 


Definition 15.2.10. A decomposition of a graph is formed by three disjoint 
sets A, B, SifV- AU BUS, S separates A and B, and S is complete. 


In this definition one or more of the sets A, B, and 5 may be empty. И 
both A and B are nonempty, the decomposition is termed proper. 


Definition 15.2.11. А graph is decomposable if it is complete or if there is 
a proper decomposition (А, B, $) into decomposable subgraphs G45 and 
Св us: | 


Theorem 15.2.9. Suppose A, В, 5 decomposes С = (V, E). Then the density 


of X, factorizes with respect to G if and only.if its marginal densities f, |, (x 4 5) 
and fs u (xg, 4) factorize and the densities satisfy 


(29) f(xy) = лазать). 


Proof. Suppose that f(x, factorizes as 
(30) fy(xy)7 Te G2- 
ce 


Because А, B, 5 decomposes С, every clique is either a subset of AUS ога 
subset of В U S. Let æ denote the cliques that are subsets of AUS, and @ 
those that are subsets of B. Then fy(xy) = hx, Sk(xg y s), Where 


(31) A(X4us) = П 8с(хс), 
Се. 
(32) К(хвоз) = A 8080). 


Integration of (30) with respect to x, for C € B\ 9 gives 


(33) faus(Xaus) = h(xao s)k(xs), 
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difficulty grade recommendation 
1 3 4 
е ——————— € ————— Ф 
2e — es 
IQ SAT 
Figure 15.2 
where 
(34) k(xs) = fk(xsus) drz, m 


In tum ў, о s(x4,,) and fey s(xgus) can be factorized, leading to (28). 


15.3. DIRECTED GRAPHS 


We now include relations with a direction; the measurement represented by 
one vertex u may precede the measurement represented by another vertex v. 
In the graph this directed edge is displayed as an arrow pointing from u to v; 
in notation it appears as (и, v), which is now distinguished from (v, u). The 
precedence may indicate the times of measurement, for example, the precipi- 
tation on two successive days, or may indicate possible causation. 

The difficulty of an examination x, may affect the grade of a student хз; 
the grade is also affected by his/her IQ x;. In turn the grade of the student 
influences the quality of a letter of recommendation x,; the IQ is a factor in 
performance on the SAT, xs. See Figure 15.2. (We shall draw figures so that 
the action proceeds from left to right.) 

А graph composed entirely of directed edges is called a directed graph. A 
cycle, such as 1 > 2, 2 > 3, 3 1, is hard to interpret and hence is usually 
ruled out. A directed graph without a cycle is an acyclic directed graph (ADG 
or DAG), also known as an acyclic digraph. All directed graphs in this 
chapter are acyclic. 

An acyclic directed graph may represent a recursive linear system. For 
example, Figure 15.2 could represent 


(1) X =u, 
(2) X,=u, 
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(3) X; = ВХ, + Bx Х, + us, 

(4) X, = Вз X3 us, 

(5) X; = В» Xa + Us, 


where и, из, м, Ug, из are mutually independent unobserved variables. l 
Wold (1960) called such models causal chains. Note that the matrix of 

coefficients is lower triangular. In general X; may depend on X,...... X, iae 
The recursive linear system (1) to (5) generates the recursive factorization 


(6) finas (X15 Хо, Хз, X45 X3) 
= fix) Р(х) fus (іх xs) fas Gab за (xsl). 
A directed graph induces a partial order. 


Definition 15.3.1. А partial ordering of an acyclic directed graph u < v is 
defined by the existence of a directed path 
(7) U = Uy OUO cs 230, =D. 


n 


The partial ordering satisfies the conditions (i) reflexive: v < v: (ii) transitive: 
и <v and v <w imply u <и; and (iii) antisymmetric: и x v and v <u imply 
и =). Further, и < v and и зо defines и <v. 


Definition 15.3.2. Ги — v, then и is a parent о/о, termed и = palv), and 
v is a child ори, termed о = ch(u). In symbols 


(8) ра(ь) = (w e VNulw > v), 
(9) ch(u) = (weVNulu >w}. 


In the graph displayed in Figure 15.2 we have (1,2) = ра(3), 3  pa(4). 
2 = pa(5), 3 = ch(1,2), 4 = ch(3), and 5 = ch(2). 


Definition 15.3.3. If u <v, then v is a descendant of u, 
(10). de(u) = {vlu <vo}, 
and u is an ancestor of v, 
(11) an(v) = {ulu < v). 


The set of nondescendants of и is Nd(u) = V\de(u), and the set of strici 
nondescendants is nd(u) = Nd(u) Nu. Define An( A) = an( 4) UA. 
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pa(v) 
w 
e v de(v) 
e vy 
Figure 15.3 
Note that 
(12) pa(v) can(v) Cnd(v). 


In our study of undirected graphs we considered three Markov properties 
independently defined and then showed that a graph with one Markov 
property also has the other two. In the case of acyclic directed graphs we 
shall define three similar Markov properties, but the definitions are different 
because they take account of the direction of action or influence. 


Definition 15.3.4. A distribution on an acyclic directed graph С is pairwise 
Markov if for every v Е V and w € nd(v)\ pa(v) 


(13) | v i wInd(v) Nw. 


In comparison with Definition 15.2.1 for undirected graphs, note that 
attention is paid only to vertices in nd(v); since pa(v) is the effective 
boundary of v, the vertices w and v are nonadjacent. (See Figure 15.3.) Note 
also that the conditioning set inclues the parents of v, but not the children 
(which are descendants). 


Definition 15.3.5. А distribution on an acyclic directed graph is locally 
Markov if 


(14) ра [nd( 5) N pa(v)] |pa(v). 


In the definition of locally Markov the conditioning is only on the parents 
of о, but in the definition of pairwise Markov the conditioning is on all of the 
other nondescendants. These features correspond to Definitions 15.2.1 and 
15.2.3 for undirected graphs. 

In Figure 15.2, we have 112,5, 31,512, 411,2,5|3, and 5.11,3,412. 
In an undirected graph constructed by replacing arrows in Figure 15.2 by 
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lines (directed edges by undirected edges), a locally Markov distribution on 
the graph would include the conditional independences 1.1.213, 1,2 1.4/3, 
1,3,44L5. In the interpretation of the arrow indicating time sequence X, 
relates to the future of (X,, X,); the future cannot be conditioned on. 

As another example, consider an autoregressive time series yg, y,,.... Ут 
defined by 


(15) у= ру, 1 +и,, i=1,2,...,T, 


where u,,...,u, are independent N(0, o?) variables and yọ has distribution 
МО, о? /(1 — p)?)]. In this case given y,, the future y,,,,..., yr is indepen- 
dent of the past y,,..., У,-1. 


Theorem 15.3.1. А locally Markov distribution оп an acyclic directed graph 
is pairwise Markov. 


Proof. The proof is the same as the proof of Theorem 15.2.3 for undi- 
rected graphs. a 


Theorem 15.3.2. А pairwise Markov distribution on an acyclic directed 
graph is locally Markov. 


Proof. The proof is the same as the proof of Theorem 15.2.4. a 


Another Markov property is based on numbering the vertices in an order 
reflecting the direction of the action or the partial ordering induced. 


Definition 15.3.6. An enumeration of the elements of V is colled well- 
numbered ifi<j = v; Жи, or equivalently yj <u, = j «i. 


Theorem 15.3.3. А finite ordered set (V,<) admits at least one well- 
numbering. 


Definition 15.3.7. An element a* є V is maximal (or terminal) if a* <b 
= g*-b. 


Lemma 15.3.1. .4 finite, partially ordered set (V, «) has at least one 
maximal element а*. 


Proof of Lemma. The proof is by induction with a* =a if 4(V)- 1. 
Assume the lemma holds for Z(V) = п, and consider #(V)=n +1. Then 
V=auU(V\a) for any a € V. Since 4(VNa) =n, И\а has a maximal ele- 
ment, say a. Then either @ <a and so a is maximal, or à €a and so à is 
maximal. z 
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Proof of Theorem 15.3.3. We shall construct a well-numbering. Let v* bea 
maximal element; define v, = v*. In VNu, let v** be a maximal element; 
define v, , — v**. At the jth stage let v*** be a maximal element in 
VN Was- -s Unj); define Uaj = U***, j=3,...,n— 1. Then v = 
VN(u,,...,u, .,). This construction satisfies Definition 15.3.6. © m 


The well-numbering of V as v",...,v°" implies that in any directed path 
и = plo) o yD) a ... _, vC? =v the indices satisfy ig <i,< - <i, The 
weil-numbering is not necessarily unique. Since V is finite, a maximal 
element can be found by comparing v, and у, for at most n(n — 1)/2 pairs. 


Definition 15.3.8. Let {v,,...,v,} be a well-numbering of the acyclic di- 
rected graph С. A distribution on G is well-numbered Markov with respect to 
this well-numbering if ` 


(16) v; dL (Ui. U1) Npa(u)Ipa(u), i—3,...,n. 


Apparently the definition depends on the choice of well-numbering, but 
this is not the case, by Theorem 15.3.4, 


Theorem 15.3.4. А distribution on an acyclic directed graph that is well- 
numbered Markov is locally Markov. 


Proof. (v,,...,0;,_;) € пои) ра(ь,). и 


The definition of the global Markov property depends on relating the 
directed graph to a corresponding undirected graph. 


Definition 15.3.9. The moral graph С” of an acyclic directed graph G — 
(V, E) is the undirected graph constructed by adding (undirected) edges between 


parents of each vertex v € V and replacing every directed edge by an undirected 
edge. 


In the jargon of graph theory, the parents of a vertex are “married.” 


Definition 15.3.10. А distribution on сп acyclic directed graph is globally 
Markov if A і BIS for every A, B, and S such that S separates А and B т 
[Gantau pu sl" 


Theorem 15.3.5. А distribution on an acyclic directed graph that is globally 
Markov is locally Markov. 
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Proof. For any v € V let pa(v)=S іп the definition of globally Markov. 
Let v=A and nd(v)\pa(v) = B. A vertex w € nd(v)\ pa(v) is a vertex in 
An(AUBUS). Let r=W=dp,0),...,U, =U be a path from w to v in 
[Ga Au ao sl" = Сма)". Ш (u, ,,U,) corresponds to a directed edge 
(0, , — 5) in (Сук, then v,_, € pat) = 5 and рабо) берага» 
nd(v)\ pa(v) and v. [The directed edge (v,_, < u,) implies v,., € de i 


Theorem 15.3.6. A distribution on an acyclic directed graph that is locally 
Markov is globally Markov. 


The proof is very lengthy and is omitted. 


Recursive Factorization р 
The recursive aspect of the acyclic directed graph permits a systematic 
factorization of the density. Use the construction of Theorem 15.3.4. Let 
п=|И; then v, is a maximal element of V. Then 


(17) | Xy ac, Ш X, | pa( U,)- 


Thus (in the normal case) 


(18) £X,,|pa( U,) =O, +В, X ще, > 

(19) é(X, — EX (Xon - eX, ) = En 

At the jth step let v, ;,, be a maximal element of VN Up, ..., 0, .;,»). Then 
(20) Xy, iss jeans у Ш X, a Іра(о,-;1). 

Thus 

(21) 6X, iu |pa(u, 1) = Мун + B, jii Xp uam 
(22) 4(X, 7 X, (X, 7 X, a) "Xs Paden 1. 


i 2 
The vector X, |, is independent of pa(v,_;,,). The relations (18) to Q2) 
can be written as generating equations. Let 


(23) x,-a,te, 
(24) X,=Q,+B,x,+€, 
(25) хира, 4 + By _1(Х1,.. Xn) Eni 


(26) x, = a, -B,(xy. Xna) Е), 
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where €,,...,€, are independent random vectors with CE jE; = X, In matrix 
form (23) to (26) are 
(27) : Вх= о Е, 
where 
a, I 0 0 0 £1 
a, -Ba I 0 0 £5 
(28) Q = а; , В = —B4 —B4 0 ‚ == £4 ; 
e, -B, -B,; ~By UT 1 =, 


and B,=0 if i<j—k,. Because the determinant of B is 1, (27) can be 
solved for ' 


(29) х= Гоа + Ге. 


The matrix I^! is also lower triangular. 


15.4. CHAIN GRAPHS 


А chain graph includes both directed and undirected edges; however, only 
certain patterns of vertices and edges are permitted. Suppose the set of 
vertices V of the graph С —(V, E) can be partitioned into subsets V = 
И) О --- UV(T) so that within a subset the vertices are joined by undi- 
rected edges and directed edges join vertices in different subsets. Let 7(G) 
be the set of vertices 1,...,7 and let &(G) be the (directed) edge set such 
that 7 с if and only if there is at least one element и € V(r) and at least 
one element ое И(о) such that и >v is in Е, the edge set of С. Thea 
B(G)=[7(G), &(G)] is an acyclic directed graph; we can define pa 5 (7), 
etc., for (С). 

Let X,-(X,lu E€ И(т)). Within a set the vertices form an undirected 
graph relative to the probability distribution conditional on the past (that is, 
earlier sets). See Figure 15.4 [Lauritzen (1996)] and Figure 15.5. 

We now define the Markov properties as specified by Lauritzen and 
Wermuth (1989) and Frydenberg (1990): 


(CI) The distribution of X,, т= 1,..., T, is locally Markov with respect to 
the acyclic directed graph (С); that is, 


(1) X, L XX, у c end, (r)Npags(r). 
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V2) 
Figure 15.4. A chain graph. 


У) Бау Wa УЗ) 
ү(2) 
Figure 15.5. The corresponding induced acyclic directed graph on V = И(1) U V2) U V(3). 


(C2) For each 7 the conditional distribution of X, given X 
Markov with respect to the undirected graph on V(r). 
(C3) 


(7) is globally 


Раз 


(2) X, A. ХХ (U) ие ОС (т), vepag(r) Npag(U). 


Неге bd,(U) = pag(U) U п6с(0). A distribution on the chain graph G that 
satisfies (C1), (C2), (C3) is LWF block recursive Markov. 

In Figure 15.6 pa g(r) = (т — 1,7— 2) and nd „(т)\ра a(r) = {т- 3,т- 
4,...,1). The set И —(u,w) is a set in ИС), and pa,,(U) is the set in 
Ит 1) U V(r — 2) that includes рас(и) for u € U; that is, pag(U) = (x, у}. 


У(т-1) Veo 


Figure 15.6. A chain graph. 
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va) VQ) 
Figure 15.7. A chain graph. 


Andersson, Madigan, and Perlman (2001) have proposed an alternative 
Markov property (AMP), replacing (C3) by 


(C3*) 
(3) X, 1 XX, иу» ueUcV(r), vepag(r)Npag(U). 


In Figure 15.6, X, for a vertex v in И(т-- 2) О V(z— 1) is conditionally 
independent of X, [u E U CV(7)] when regressed on Хы = (X,, X,). 
The difference between (C3) and (C3*) is that the conditioning in (C3) is on 
bdc(U) = pag(U) о nb;(U), but the conditioning in (C3*) is on рас(И) 
only. See Figure 15.6. The conditioning in (C3*) is on variables in the past. 
Figure 15.7 [Andersson, Madigan, and Perlman (2001)] illustrates the differ- 
ence between the LWF and AMP Markov properties: 


(4) LWF: X,&X|X,X, ХХХ, X, 


(5) AMP: X,X,X, X; 4 XlX. 


Note that in (5) X, and X, are conditionally independent given X,; the 
conditional distribution of X, depends on pa(v,), but not Х.. 

The AMP specification allows a block recursive equation formulation. 
In the example in Figure 15.7 the distribution of scalars X, and X, 
Го, v € V(1)] can be specified as 


(6) X =g, 
(7) X= е). 


where (21,22) has an arbitrary (normal) distribution. Since X, depends 
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directly on X, and X, depends directly on X,, we write 


(8) X3 = B4 Xi + ез, 


(9) Xa = By X; + в., 


where (ез, =.) has ап arbitrary distribution independent of (є, =.), and 
hence independent of ( X,, Х.). 
In general the AMP model can be expressed as (26) of Section 15.3. 


15.5. STATISTICAL INFERENCE 


15.5.1. Normal Distribution 


Let x,,...,xy be N observations on X with distribution N(p., X). Let x= 
NIEN x, and S-(N - 17! EN (х, xXx, у = (CN — ППУ x, x, 
— Nxx']. The likelihood is 


(1) (20) 7 3] TN 2 p- к, a Yn Gua 


E (22) |У] ZN /2 oT HON DUSE рУ Ga 


The above form shows that x and 5 are a pair of sufficient statistics for р 
and X, and they are independently distributed. The interest in this chapter is 
on the dependences, which depend only оп the covariance matrix X, not p. 
For the rest of this chapter we shall suppress the mean. Accordingly, we 
suppose that the parent distribution is N(0, £) and the sample is x,..... Xp) 
and 5 = (1/n)25 x, x,. Tte likelihood function can be written 


(2) (20) PPAP" ет "А5 


-exp]-Y(A)- Ма Ул; у 


p 
=| i<j 


1 
2 i 
where А = (Qu) = X, Те (0) = XL xx, and W(A)- ipn log27) 
— iniog| Al. 

The likelihood is in the exponential family with canonical parameter А 
and statistic T. The maximum likelihood estimator of X with no restriction is 
$ = S = (1/п)Т. Since A = X^! is a 1-to-1 transformation of X, the maxi- 
mum likelihood estimator of A of А=$-1. 
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15.5.2. Covariance Selection Models 


In undirected graphs many of the models involve zero restrictions on ele- 
ments of A. Dempster (1972) studied such models and introduced the term 
covariance selection. When the (directed) graph satisfies the pairwise Markov 
condition, А, = 0 for (i, j) € E. We assume here that the graph satisfies this 
Markov condition. Further we assume n > p; then 5 is positive definite with 
probability 1. 

The likelihood function is 


p 
(3) (20) ^ |A| БА ехр |5 Ли + У Agni; |, 
i=l Q,DeE 


where A satisfies the condition A;; = 0, (i, j) £ E. In this form the canonical 
parameters are Àj, ..., Àj, and Aj, (ij) e E. The canonical variables are 
Sys 8р and s; (1,7) є Е; these form a sufficient set of statistics. To 
maximize the likelihood function we differentiate (3) with respect to А;, 


i=1,...,p,and A, (i, j) € E, to obtain the equations (4) and (5). 

Theorem 15.5.1. The maximum likelihood estimator of X, in the model (3) 
is given by | 
(4) б, = Sijo і= јог (i,j) ЄЕ, 
(5) Àj; = 0, i#jand (i,j) € E, 
where А = $1. 

This result follows from the general theory of exponential families. See 
Lauritzen (1996), Theorem 5.3 and Appendix D.1. 

Here we shall show that for a decomposable graph the equations (4) and 


(5) have a unique positive definite solution by developing гп algorithm for its 
computation. We follow Speed and Kiiveri (1986). 


Theorem 15.5.2. Let L and M be p X p positive definite matrices. There 
exists a unique positive definite matrix K such that 


(6) ki = lij, i=jor (i,j) ЕЁ, 
(7) | k = mH, i*jand (i,j) € E, 
where (ki!) = K^' and (т!) = M^. 

The proof of Theorem 15.5.2 depends on several lemmas. In the maximum 
likelihood estimation Г, = 5, M = І ог any other diagonal matrix, and К = 2. 
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To develop this subject we use the Kullback information. For a pair of 
multivariate normal distributions N(0, P) and N(0, R) define 


n(x|0, P) 


(8) I(PIR) = ё, log (x10, RY 


=: — 4 [logl PR] + tr(1— РК !)]. 
Lemma 15.5.1. Suppose P and R are positive definite. Then: 


(i) KPIR) > 0, P+R, and I(P|P) = 0. 


(ii) If (P,) and (R,) are sequences of positive definite matrices such that 
КР, |1,) > 0, then P, R,! І. 


Proof. (i) Let the roots of |P — sR| = 0 be s, < + x s,. Then 


i=l 


P 
(9) log| PR! | + tr(J— РКУ!) = У (logs;+1-s;) >0, 


and (9) is 0 if and only if ѕ = © =s, = 1. 
(ii) Let the roots of |P, — 58 „| =0 be s(n) < = <s,(n). Then KP,IR,) 
0 implies [51(п),..., 5„(п)] > (1,..., 1), which implies that PR, І. 


Lemma 15.5.2. Let 


(10) p= Pa Py _ К, Ry 
Py Py , Ry R3 
Then 
(1) The matrix 
(11) Q- Py Р, Rg Ro 
КЕР. Ry ЕЕЕ + RyRy Py Ry Ry 


satisfies О = Р, Q? =R”, and Q” = R”, where 


-1 02ү p22\7! p21 1 - - 
(12) e - +К (R) R MNA 0 +R. 
R R? 0 0 


Gi) РІК) = (РО) + КОК). 
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Proof. (i) Let 


"E S рл 
(13) | Q ZH R2 


Then I—Q !Q can be solved for $ = Рг! + RP(R?)1g?!; Q=(Q7!)"! 
follows from Theorem A.3.3. Then (ii) follows from 


(14) PQ^'-PR^ + 


I-P Rọ 0 g^ 
0 0 > Q — 


P Rọ 0 
RyRy Ри I ' 


and |PQ^!| -|QR^! = |PR^! |. From (13) and (14) 
(15) tr РО! + tr QR“! = tr PR! e trI. a 


Lemma 15.5.2 provides the solution to the problem of finding a matrix Q, 
given positive definite matrices P and R, such that 


(16) dij = Pip 
(17) д =, 


(2,37) € (1,...,1), 

(i,j) € (1,...,1). 
We now develop an iterative method to find K to satisfy (6) and (7), thus 

proving Theorem 15.5.2. Suppose E —c,U = Uc,,, where c,,...,c,, are the 

m cliques of a decomposable graph С = (V, E). Let Kj! = M^ !. Define 

recursively К, = (А, (и)) such that 

(18) k(n) 7L, i,jec 


(19) k(n) = k"(n — 1), 


n mod m? 
i,j € Cn mod m* 
By Lemma 15.5.2, K, is uniquely determined. (The algorithm cycles through 
the cliques.) By construction 


(20) KK, ,) = 1(EIK,) + 1(K,|K,-1). 


Summation of (20) from 1 to q gives 
q 
(21) КИК) = I(LIK,) + У ЦКК, |). 
jel 


Since I(L|K,| > 0, £7_,/(K,|K;_,) is bounded and ККДК, у) 0 as n > oo. 
The set (K^! |(L|K) <I(L|K,)} is strictly convex. 

Consider the vector sequence (K,,, , ,,..., K,,,,,,) with index r (n = rm). 
It has a convergent subsequence {r(z)}; that is, (K,,(5,,..., Kmra+m) con- 
verges to (KT,..., Ке), say. Since КДК) 5 0, K,K;), >T. Then the 
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с E 


Gj) € Суч C2 


Figure 15.8. Diagram of c, and c; for c, U c; = Е. 


matrix Кино; Кикин >T j 7 2,..., m. which implies КЕ = + =K = 
К, say. Note that (i,jli,j € E) satisfies i,j €c;, i— L...,m. Hence К, 
satisfies (7), п —0,1,..., and К does too. Further, кт) +t) satisfies 


(18) i,j € c, and К does, too. Figure 15.8 diagrams the sets for c, = C, j) 
i,j=l,....t, and c;7(,j, i j=u,u +1,..., p, U <t. 

The procedure allows for construction of a multivariate normal distribu- 
fion with arbitrary marginal distributions over the cliques c,,...,¢,,, provided 
ihat the specified marginal distributions are consistent. 

Theorem 15.5.2 provides a proof of the existence and uniqueness of the 
maximum likelihood estimators. 

The equaticn (12) is an updating equation. When Q^! = K;' and R^' = 
K;!,, the entries in K;!, not in c, ,,,,,, remain unchanged. 

Dempster (1972) also proposes some iterative methods for finding K 
satisfying k;—1;, (ЛЕО, and k —m", (0,7) €D. The entropy of 
n(x|0, P) is 


(22) £, log n(xl9, Р) = —-i(plog2s — p – logi Pl). 


Note that |£| = IIA.,c;| RI, where В = ( pj). Given that д; = 5, the selec- 
tion of р; to maximize the entropy of the fitted normal distribution satisfying 
the requirements also minimizes |R| [Demspter (1972)]. 
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15.5.3. Decomposition of Covariance Selection Models 


An undirected graph is decomposable if the graph is formed by three disjoint 
sets А, B, C, where V=AUBUC, A and B are nonempty, C separates A 
and B, and C is complete. Then if X, is globally Markov with respect to G, 
we have X, L ХХ, 


А дд 0 А дс 
(23) XY-A-| 0 Asp Аве |, 
Aca Ace Асс 


Халл X45 Zac - 
24 Ў авс = — | [У(Х а, св), 
(24) (ABYC B Xy Sac celca: Хсв) 
and 
(25) Zas- Lackec&ce = 0. 


The maximum likelihood estimator of X can be constructed from the 
maximum likelihood estimators of E ,4.c, Хдв.с Хвв-с» Велвус» and Bec. 
If there is no restriction on E, the maximum likelihood estimator of X is 


^ аву + Su mcScéScuag Same | 
(26) Ўр = | = 54вс› 


Scias) Scc 
where 
(27) 5(авус = ls: s. , Sase” (Sac; Sac)- 
BA-C BBC . 


If the restriction (25) is imposed, the maximum likelihood estimator is (26) 
with 8, в.с replaced by 0 to obtain 


Saa-c 0 Sac | 6-1 x 

^ T Sect Sca S 
(28) х,= | 0 Saa-c Sac} SO €^ св) Sac 
(Sca. Sca) Scc 


The matrix S45, has the Wishart distribution W[Z,45.c, n — (p, + Pa), 
where p, and py are the number of components of X, and Хь, respectively 
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(Section 8.3). The matrix B, „вус = S(4gyc Sce conditional on (Xer ---, Xen) 
= Хи. has a normal distribution, the covariance of which is given by 


X, 0 
0 XJ 


and Scc has the Wishart distribution ИС cc, n). The matrix S, „вус and the 
matrix B, вус» are independent (Chapter 8). 

Consider testing the null hypothesis (25). This is testing the null hypothesis 
Хавс =O against the alternative 2,5, #0. The determinant of Q6) is 
[541 = IS4s.cl :ISccl; the determinant of (28) is [Х| = 1S4 4l |5 вв! -1Sccl. 
The likelihood ratio criterion is 


(30) I$, E 15‹авусі \" 
PA IS4a-cl ` [$ вв-с! ` 


П 


Bac 
Bac 


Bac 2 
(29) é vec Bac vec Scc = See ® 


Since the sample covariance matrix 5, „вус has the Wishart distribution 
W[X,45yc, n — (ра + рв)), where p, and p; are the numbers of components 
of X, and X, (Section 8.2), the criterion is, in effect, u 
studied in Sections 8.4 and 8.5. 

As another example, consider the graph in Figure 15.9. Note that node 4 
separates (1, 2, 3) and (5, 6); nodes 1,4 separate 2 and 3; and node 4 separates 
5 and 6. These separations imply three conditional independences: 
Qt, X4, X) LOG, ХХ, X; 1 ХУ(Х,, X4), апа X5 в ХХ. In terms of 
covariances these conditional independences are 


Pas Pa n -Cpa +рв)? 


(31) © (123)¢56)-4= У (123)56) = У (123) Хм Faso =0, 
(32) 233.114) =f} Yat) Fabas Eaa =0, 
(33) Ess4 15 — Ès P» У = 0. 
2 5 
4 
1 
3 6 


Figure 15.9 
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In view of (31) the restriction (32) can be written as 
(34) Ува = 233.47 У.Х. = 0, 


It will be convenient to reorder the subvectors as X5, X4, X,, X,, Xo X, to 
write 


Sy» Sy Su 
(35) s- 
So 5% Soa 
542 5% Sag 
5724 UT 526.4 Sy Sy 
_ : + Sa! [S4 Sas] : 
524 c7 бы Sa | Su 
[$4 UT 5% | 544 
m Sea...0y2...6-4 + So, iS 5402.6 50...64 
542...6) S, 
The determinant of S is 
(36) 18] = Sá. оо... 6411841: 


If the condition (X,, Х,, X3) в (X;, X,)|X, is imposed, the maximum likeli- 
hood estimator is (35) with 5(125)66-4 replaced by 0 to obtain 


So35035.4 0 a 
(37) 0 5(56)(56)-4 + 50.6454 бө Зе... в 


42...6) Sa 


The determinant of (37) is 


(38) | S303). | ` | S(56)(56)-4 | ‘| Sa [. 


15.5 STATISTICAL INFERENCE 621 


The likelihood ratio criterion for Хз. = 0 is 


n/2 


[бе 6X2 6-4 | 
39 —_ eos E = UM eoa: 
( ) | S2siy(231)-4 | ‘| Siseco.a] ees 


Here U,,5$., has the distribution of О ep eps рк+рьп-р (Section 8.4) since 


the distribution of S5 бо. 4&4 is WX. 0... 1 — Ps), independent 
of 8,4. 
The first three rows and columns of 50 _.60...6.4 Constitute the matrix | 


(40) 
522.4 S54 $514 


$ (231)0231-4 = 532.4 5334 51.4 
512.4 13.4 511.4 


э. Sx. 
ИТИИ 
(51.4, 513.3} Sua 
a Rs) tne. (52°) | 
(512.4, 513.4) Siig 
The determinant of (40) is 
(41) [Sonens | = | Senen | 18.41. 


The estimator of $5,505, with X; Ш ХХ, X, imposed is (40) with 5,3, 
replace by 0 to obtain 


$5, 0 $54 $4.4 
р tle 151.08.453. І 
(42) | 0 S514 Sia a 812.4, 513 4) $54] |, 
(Sizs 513.4) Sua 


the determinant of which is 


(43) [552-14 |1 зз-14 | `[511-4 | 
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The likelihood ratio criterion for Хз. = 0 is 


n/2 
Sese»! |^ yan 
(44) [$5.4] “1833.141 ee 


The statistic U.,.,, has the distribution of U,, р и-(ри+ра+рз+р4) (Section 84) 
since S,o3,23)-14 has the distribution ИХ 23)¢23)-149 n — (p, + p.)] independen 
of 5 p 
(3X03) А . " "uL 
The estimator of Eg44,., with 2,4., = 0 imposed is Sis656)-4 With 856-4 
replaced by 0 to obtain 


555-4 0 
(45) 0 854 
The likelihood ratio criterion for testing X54. = 0 is 
n/2 
| Ssexsey.al _ и. 
(9) ККИ i 


The statistic 44,4 has the distribution of Ир, 5, | (p eps spo) since Sos. 
has the distribution W(X вос. 2 — Ра) independent of 54. 

The estimator of X under three null hypotheses is (37) with Spspasp4 
replaced by (42) and S444, replaced by (45). The determinant of this 
matrix is 


(47) b = 182.141 “1833.41 151.41 - 1555.41 - 1566-41 15.41. 


The likelihood ratio criterion for testing the three null hypotheses is 


^, n/2 
[X | n/2 
(48) zs = (Upsiy¢50)-4U23-14U56-4) . 
Ix, 

When the null hypotheses are true, the factors Чому Ола. and Us,., are 
independent. Their distributions are discussed in Sections 84 and 85. In 
particular the moments of these factors are given and asymptotic expansions 
of distributions are described. 


15.5.4. Directed Graphs 


We suppose that the vertices are well-numbered, 1,..., п; the М observations 
x хм, are made on X=(Xj,...,X,)’. The model is (22) to (25) or 
De X 2 o 
Q6) of Section 153. Let x - N^! X^ уху апі S=(N-1) Eri) 
XXx,,, — X). The model (26) consists of x, = e; + 21 and n — 1 regressions 


ACKNOWLEDGMENTS 623 


(23) to (25). The vector a, in x, =а + є, is estimated by ¥,. If pale.) is 
vacuous, a is estimated by X,; if pa(v.) is not vacuous and X 


pauo = Ха. then 
В, and a, are estimated by 


. N N -1 
(49) В, = E (40) (rai) E (emira)! , 


In general 
^ N ГА 
(51) В; = » [xi - Xj] [хохо о] 
an 
N -1 
_ _ А 
. L [ходо — Я рысь, | [х,ое — Xo] | , 
ac 
(52) Oj HX; + ВХ ау 


Conditional on Xpa( ux a the distribution of these estimators is normal. 


15.5.5. Chain Graphs 


The condition (C1) of Section 15.4 specifies that X, 1 ХХ а a7 for ue V(r) 
and for v € V(o), where o € nd (7) N ра „(т); that is, the past earlier than 
ра (Tr) is independent of the present. This condition corresponds to the 


Markov property in time series analysis. Thus X, is in terms of deviations 


from the regression of X, on Xn utr) 


EX lX ap) =t, +В, X apa) 


The vector œ, and the matrix B, are estimated as for directed graphs. 

The Markov property (C2) indıcates the analysis in terms of deviations 
X,—o,— B, X apa) The estimation of the structure of dependence within 
V(r) is carried out as in Section 15.5.2. 

The Markov property (C3*) specifies X, iL ХХ аи) for ueU cV) 
and v € pa (И) U nb, (U). The property is a restriction on the regression of 
X. on X 


pa g(r)* 
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APPENDIX A 


Matrix Theory 


АЛ. DEFINITION OF A MATRIX AND OPERATIONS 
ON MATRICES 


In this appendix we summarize some of the well-known definitions and 
theorems of matrix algebra. A number of results that are not always con- 
tained in books on matrix algebra are proved here. 

An m X n matrix Аба rectangular array of real numbers 


ап an n 

ал аз c" d, 
(1) А= |. . . |, 

any a m2 amn 


which may be abbreviated (aj), i2 1,2,...,m, j= 1,2,..., n. Capital bold- 
face letters will be used to denote matrices whose elements are the corre- 
sponding lowercase letters with appropriate subscripts. The sum of two 


matrices 4 and B of the same numbers of rows and columns, respectively, is 
defined by 


(2) A * B — (а) + (bj) = (aij +b). 
The product of a matrix by a real number A is defined by 


(3) AA — AA = (ла,,). 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
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These operations have the algebraic properties 


(4) A+B=B+4A, 

(5) (A+B) +С=А+(В+С). 
(6) A+(~1)A=(0), 

(7) (A+ p)A=AA t+ uA, 
(8) С A+B) 2 AA & AB, 
(9) AC ша) = (лм) A. 


The matrix (0) with all elements 0 is denoted as 0. The operation 4 +(—1B) 
is denoted as A — B. 

If A has the same number of columns as В has rows, that is, A = (a;;), 
i=1,...,4, j=1,...,.m, B=(b,), j=1,....m, k=1,...,n, then А and B 
can be multiplied according to the rule 


m 
(10) АВ -(aj)(b,) = | ба, f=1,...,1, k=1,....n; 
j=] 


that is, АВ is a matrix with / rows and п columns, the element in the ith row 
and kth column being 17-1254. The matrix product has the properties 


(11) ( AB)C = A( BC), 
(12) A(B +С) = AB + AC, 
(13) (A - B)C = AC + BC. 


The relationships (11)-(13) hold provided one side is meaningful (i.e., the 
numbers of rows and columns are such that the operations can be performed); 
it follows then that the other side is also meaningful. Because of (11) we can 
write 


(14) (4B)C = A( BC) = ABC. 


The product BA may be meaningless even if AB is meaningful, and even 
when both are meaningful they are not necessarily equal. 

The franspos^ of the 1X m matrix A = (aij) is defined to be the m x1 
matrix A’ which has in the jth row and ith column the element that A has in 
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the ith row and jth column. The operation of transposition has the proper- 
ties 


(15) (A')' =A, 
(16) (A+B)'=A'+B', 
(17) (AB)'=B'A', 


again with the restriction (which is understood throughout this book) that at 
least one side is meaningful. | 

А vector x with т components can be treated as a matrix with т rows 
and one column. Therefore, the above operations hold for vectors. 

We shall now be concerned with square matrices of the same size, which 
can be added and multiplied at will. The number of rows and columns will be 
taken to be p. А is called symmetric if А = A'. A particular matrix of 
considerable interest is the identity matrix 


100 0 
0 1 0 0 

(18) 1-|0 0 1 0| = (5,), 
000 1 


where à. the Kronecker delta, is defined by 

(19) &;-1, i=j, 
=0, іж}. 

The identity matrix satisfies 

(20) IA = AI = A. 


We shall write the identity as J, when we wish to emphasize that it is of 
order p. Associated with any square matrix 4 is the determinant 41, defined 
by 


р 
(21) Idi = Ecco Па 
is 
where the summation is taken over all permutations (j,,...,/,) of the set of 
integers (1,..., p), and f(j,,....j,) is the number of transpositions required 
to change (1,..., p) into (j,,...,j,). A transposition consists of interchanging . 


two numbers, and it can be shown that, although one can transform (1,..., p) 
into (j,,...,j,) by transpositions in many different ways, the number of 
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transpositions required is always even or always odd, so that (— 1)/0% Јо) is 
consistently defined. Then 


(22) |AB| = |A| -|Bl . 
Also 
(23) |A| = 1A'l. 


A. submatrix of A is a rectangular array obtained from A by deleting rows 
and columns. A minor is the determinant of a square submatrix of A. The 
minor of an element a; is the determinant of the submatrix of a square 
matrix А obtained by deleting the ith row and jth column. The cofactor of 
aij Say Aj, is (—1)'* times the minor of a;;. It follows from (21) that 


p p 
(24) |41 = Eat L аА. 
i= i= 


If |A| #0, there exists a unique matrix B such that АВ =Г. Then B is 
called the inverse of A and is denoted by A^!. Let а! be the element of A^! 
in the Ath row and kth column. Then 


(25) q^ = kh 


The operation of taking the inverse satisfies 


(26) (4C) '-C4^!, 
since 
(27) (4C)(C^47!) = A(CC7!) 47! = AIA! = AA! = 1. 


Also I`! =I and A^!A = I. Furthermore, since the transposition of (27) gives 
(A71 YA' = І, we have (471) = (A). 

A matrix whose determinant is not zero is called nonsingular. If | A| = 0, 
then the only solution to | 


(28) Az = 0 


is the trivial опе z = 0 [by multiplication of (28) on the left by 4-!]. If 
|A| = 0, there is at least one nontrivial solution (that is, z 0). Thus ап 
equivalent definition of А being nonsingular is that (28) have only the trivial 
solution. 

A set of vectors 21,...,2, is said to be linearly independent if there exists 
no set of scalars c,,...,c,, not all zero, such that Li_jc,z;=0. А q Xp 
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matrix D is said to be of rank r if the maximum number of linearly 
independent columns is r. Then every minor of order r-- 1 must be zero 
(from the remarks in the preceding paragraph applied to the relevant square 
matrix of order г+1), and at least one minor of order r must be nonzero. 
Conversely, if there is at least one minor of order r that is nonzero, there is 
at least one set of r columns (or rows) which is linearly independent. If all 
minors of order r +1 are zero, there cannot be any set of r+ 1 columns (or 
tows) that are linearly independent, for such lincar independence would 
imply a nonzero minor of order r + 1, but this contradicts the assumption. 
Thus rank r is equivalently defined by the maximum -number of linearly 
independent rows, by the maximum number of linearly independent columns, 
or by the maximum order of nonzero minors. 
We now consider the quadratic form 


p 
(29) х'Ах= У djjX;X,, 


i,j=l 


where x’ = G,...,x,) and A = (a;;) is a symmetric matrix. This matrix А 
and the quadratic form are called positive semidefinite if x' Ax > 0 for all x. If 
x'Ax 7 0 for all x «0, then А and the quadratic form are called positive 
definite. In this book positive definite implies the matrix is symmetric. 


Theorem А.1.1. Jf C with p rows and columns is positive definite, and if B 
with p rows and q columns, q < p, is of rank q, then B'CB is positive definite. 


Proof. Given a vector y + 0, let x = By. Since B is of rank q, By =x #0. 
Then 


(30) y'( B'CB) y = (By) C( By) 
=x'Cx> 0. 


The proof is completed by observing that B'CB is symmetric. As a converse, 
we observe that B'CB is positive definite only if B is of rank а, for otherwise 
there exists y #0 such that By = 0. и 


Corollary А.1.1. ЈР C is positive definite and В is nonsingular, then B'CB is 
positive definite. 


Corollary А.1.2. Ј C is positive definite, then C^! is positive definite. 


Proof. C must be nonsingular; for if Cx =0 for x + 0, then x'Cx = 0 for 
this x, but that is contrary to the assumption that C is positive definite. Let 
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B in Theorem A.1.1 be C^!. Then B'CB-(C^!yCC^! = (C^. Transpos- 
ing CC^! = 1, we have (C^! C' = (C! C =I. Thus C^! = (C^! y. и 


Corollary А.1.3. The д xq matrix formed by deleting р – д rows of a 
positive definite matrix C and the corresponding p — q columns of C is positive 
definite. 


Proof. This follows from Theorem A.1.1 by forming B by taking the p x p 
identity matrix and deleting the columns corresponding to those deleted 
from С. и 


The trace of а square matrix А is defined as tr A = Ў? уа. The following 
properties are v?rified directly: 


(31) tr( 4 +B) =trA+trB, 


(32) tr AB — tr BA. 


А square matrix А is said to be diagonal if a,,=0, i#j. Then |A| = 
Пра» for in (24) |A| =a,,A,,, and in turn A,, is evaluated similarly. 

A square matrix А is said to be triangular if a; = 0 for i» j or alterna- 
tively for i<j. If а, = 0 for i>j, the matrix is upper triangular, and, if 
а, =0 for i<j, it is lower triangular. The product of two upper triangular 
matrices A,B is upper triangular, for the i, jth term (i»j) of AB is 
Ура, = 0 since a,,=0 for k «i and b,,=0 for k>j. Similarly, the 
product of two lower triangular matrices is lower triangular. The determinant 
of a triangular matrix is the product of the diagonal elements. The inverse of 
a nonsingular triangular matrix is triangular in the same way. 


Theorem A.1.2. If A is nonsingular, there exists a nonsingular lower triangu- 
lar matrix F such that FA = A* is nonsingular upper triangular. 


Proof. Let A =A,. Define recursively А, = (200) = F,_; Ag), 8 72,.... p, 
where F, , = (7580) has elements 


(33) fe =1, Ј=1....,р, 
| (g- D 
ep ЧЕ i28... p. 
(34) fi m (8-10) ? 
Ag-l.g-l 
(35) 08-0 =0, otherwise. 
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Then 

(36) а= 0, i=jt+l,....p, jHl,..,@-1 
(3) al) = а", i=1,...,g—1, ful... Ds 


-D 408-0, 
di g- iR g- Vj 
(8-1) 
a sci 


D 80 -Dag D. = ple) — a’ reer 
(38) aH =a} + Qa ai , ‚Ј = 8, 1 


Note that F = F,.,,..., F, is lower triangular and the elements of A, in the 
first g — 1 columns below the diagonal are 0; in particular 4 = FA is upper 
triangular. From |A| #0 and |F,.,|-1, we have |A,_,| #0. Hence 
a9, ..., a3, are different from 0 and the last p — g columns of A,_, can 
be numbered so 2871), ; # 0; then ДЕР is well defined. и 

The equation FA = A* can be solved to obtain A = LR, where R = А* is 
upper triangular and L =F! is lower triangular and has 1’s on the main 
diagonal (because F is lower triangular and has 1’s on the main diagonal). 
This is known as the LR decomposition. 


Corollary A.1.4. If A is positive definite, there exists a lower triangular 
nonsingular matrix Е such that FAF' is diagonal and positive definite. 


Proof. From Theorem A.1.2, there exists a lower triangular nonsingular 
matrix F such that FA is upper triangular ard nonsingular. Then FAF' is 
upper triangular and symmetric; hence it is diagonal. и 


Corollary А.1.5. The determinant of a positive definite matrix A is positive. 


Proof. From the construction of FAF P 


a 0 Q e 0 
0 aS 0 0 
(39) FAF'=| 0 0 а 


is positive definite, and hence as) » 0, g = 1,..., p, and 0 < |FAF'| = |F|: 
jal- IF| = 141. я 
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Corollary A.1.6. Jf A is positive definite, there exists a lower triangular 
matrix G such that САС’ = I. 


Proof. Let FAF' = D?, and let D be the diagonal matrix whose diagonal 


elements are the positive square roots of the diagonal elements of D?. Then 
C = D^! F serves the purpose. и 


Corollary А.1.7 (Cholsky Decomposition). Jf A is positive definite, there 
exists a unique lower triangular matrix T (t,,;=0, i<j) with positive diagonal 
elements such that A = TT’. 


Proof. From Corollary A.1.6, А = G~1(G')~', where С is lower triangular. 
Then T= G^! is lower triangular. a 


In effect this theorem was proved in Section 7.2 for А = VV’. 


А.2. CHARACTERISTIC ROOTS AND VECTORS 


The characteristic roots of a square matrix B are defined as the roots of the 
characteristic equation 


D) [B — АД =0. 


Alternative terms are latent roots and eigenvalues. For example, with 


_ [|5 2 
EH 2), 


ме һауе 


(2) 18 АД -|37 2 


|=25-4- 10a e - a 102-421. 
2 3—A 


The degree of the polynomial equation (1) is the order of the matrix B and 
the constant term is | Bl. 

A matrix C is said to be orthogonal if C'C — I; it follows that CC' =I. Let 
the vectors »' =(x,,...,x,) and y' =(yj,...,y,) represent two points in a 
p-dimensional Euclidean space. The distance squared between them is 
D(x, y) = (x — yY(x — y). The transformation z= Cx can be thought of as а 
change of coordinate axes in the p-dimensional space. If C is orthogonal, the 
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transformation is distance-preserving, for 
(3) D(Cx, Су) = (Су - €x) (Cy - €x) 
= (y -x)C'C(y -x) = (y - x) (ух) = Р(х, y). 


Since the angles of a triangle are determined by the lengths of its sides, the 
transformation z = Cx also preserves angles. It consists of a rotation together 
with a possible reflection of one or more axes. We shall denote Vx'x by |Ixll. 


Theorem А.2.1. Given any symmetric matrix B, there. exists an orthogonal 
matrix C such that 


d, 0 0 
0 d, 0 
(4) C'BC-D- 
0 0 d 


If B is positive semidefinite, then d; > 0, i = 1,..., p; if B is positive definite, then 
d;»0,i-1,...,p. 


The proof is given in the discussion of principal components in Section 
11.2 for the case of B positive semidefinite and holds for B symmetric. The 
characteristic equation (1) under transformation by C becomes 


(5) 0-1C'I-1B— АД -C| = [С° В — АРС 
= |C'BC — АД =|D-Al| 
d;-A 0 e Ü 
0 d;—À 0 p 
= : -II4-» 
0 о d,-À 


Thus the characteristic roots of B are the diagonal elements of the trans- 
formed matrix D. 

If Aj is a characteristic root of B, then a vector x, not identically 0 
satisfying 
(6) (B-AXI)x, - 0 


is called a characteristic vector (or eigenvector) of the matrix B corresponding 
to the characteristic root А,. Any scalar multiple of x; is also a characteristic 
vector. When B is symmetric, х(В — A,I) —0. If the roots are distinct, 
x;Bx,— 0 and хх = 0, ij. Let c;—(1/lxjlDx; be the ith normalized 
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characteristic vector, and let C — (e,,.. ., Ср). Then C'C-1 and BC = CD. 
These lead to (4). If a characteristic root has multiplicity т, then a set of т 
corresponding characteristic vectors can be replaced by т linearly indepen- 
dent linear combinations of them. The vectors can be chosen to satisfy (6) 
and xjx, = 0 and х; Вх; = 0, і +. 

A characteristic vector lies in the direction of the principal axis (see 
Chapter 11). The characteristic roots of B are proportional to the squares of 
the reciprocals of the lengths of the principal axes of the ellipsoid 


(7) х'Вх = | 
since this becomes under the rotation y = Cx 
P 
(8) 1=у'Бу= У а,уг. 
i=l 
For a pair of matrices A (nonsingular) and B we shall also consider 
equations of the form 
(9) |B — AA| =0. 
The roots of such equations are of interest because of their invariance under 
certain transformations. In fac , for nonsingular C, the roots of 
(10) [С'ВС - ҖС'АС)| = 0 
are the same as those of (9) since 
(11) [C'BC — AC'AC| = |C'(B — A4)CI = |C'| |B- ЛАС 
and [C'| = [CI #0. 
By Corollary A.1.6 we have that if 4 is positive definite there is a matrix 
E such that E'AE — I. Let E'BE — B*. From Theorem A.2.1 we deduce that 


there exists an orthogonal matrix C such that C'B*C-— D, where D is 
diagonal. Defining EC as F, we have the following theorem: 


Theorem A.2.2. Given B positive semidefinite and A positive definite, there 
exists a nonsingular matrix F such that 


A 0 0 
0 А e 0 
(12) FBF-|: ; Dp 
0 0 A, 
(13) F'AF - 1, 


where À, > + =A, (> 0) are the roots of (9). If B is positive definite, A, > 0, 
i-]1,...,p. 
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Corresponding to each root A, there is a vector x; satisfying 


(14) (B—2A;A)x; = 0 


and x, Ax, = 1. If the roots are distinct x; Вх; = 0 and х; "Ax, — 0, =]. Then 


=(x,,...,x,). If a root has multiplicity m, taen a set of т linearly 
independent х, 's сап be replaced by т linearly independent combinations of 
them. The vectors can be chosen to satisfy (14) and х; Bx; = 0 and x; "Ax, = 0, 
i*j. 


Theorem A.2.3 (The Singular Value Decomposition), Given an n Xp 
matrix X, n> p, there exists an n X n orthogonal matrix P, a p Xp orthogonal 
matrix Q, and an n xp matrix D consisting of a p Xp diagonal positive 
semidefinite matrix and an (п — p) X p zero matrix such that 


(15) X = РРО. 


Proof. From Theorem A.2.1, there exists a p Xp orthogonal matrix Q and 
a diagonal matrix E such that 


qw [E 0 
(16) ох‘ хо Е , 


where E, is diagonal and positive definite. Let ХО’ = Y = (Y, Y), where the 
number of columns of Y, is the order of E,. Then YjY, = 0, and hence 

Y,=0. Let P, = YE, $. Then Р/Р =Г. An n Xn orthogonal matrix P= 
" P.) satisfying the theorem is obtained by adjoining P, to make 
P orthogonal. Then the upper left-hand corner of D is Ej, and the rest of D 
consists of zeros. и 


Theorem A.2.4. Let А be positive definite and В be positive semidefinite. 
Then 


| x'Bx 
(17) A, s IM 


<А, 


where À, and А, are the largest and smallest roots of (1), and 


(18) Ар S vx S Av 
where À, and А, are the largest and smallest roots of (9). 


Proof. The inequalities (17) were essentially proved in Section 11.2, and 
can also be derived from (4). The inequalities (18) follow from Theorem 
А.2.2. и 
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A square matrix А is idempotent if A? = A. И A satisfies |A — АД = 0, 
there exists a vector x #0 such that Ax = Ах = A?x. However, 42x = A( 4x) 
=AAx = Mx. Thus А? = А, and A is either 0 or 1. The multiplicity of A = 1 is 
the rank of A. If A is p X p, then Г, — A is idempotent of rank p — (rank А), 
‘and A and I, — A are orthogonal. If A is symmetric, there is an orthogonal 
matrix O such that 


(19) ово’ = |: o], ou-4yo =|? ? 


А.3. PARTITIONED VECTORS AND MATRICES 


Consider the matrix A defined by (1) of Section A.1. Let 


и = (2j). i=1,...,p, i-1,..,q, 
Aj, = (а), = | = 

(1) р = (aij) i-l,..p, jeq-cl,...,n, 
2 = (aij). і=р+1,...,т, j=1,...,q, 
Ay = (aij), i=ptl,....m, j=qtl,...,n. 


Then we can write 


(2) A= 


We say that А has been partitioned into submatrices A,,. Let B (m x n) be 
partitioned similarly into submatrices B,,, i, j = 1,2. Then 


Ay, +B, 4р +В), 


(3) A+B= 
Ay + By Ay + By} 


Now partition С (n X r) as 


(4) C= Cy Cy 
Сы Cy 
where C,, and Cj, have q rows and C,, and C,, have s columns. Then 
(5) AC — Ay 40 | Са €» 
Ay An]iCn Cy 


Ay Cy +4Сл 4С +ApCy 
А16 +AyCy 4С: +AnCy | 
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To verify this, consider an element in the first p rows and first s columns of 
AC. The i, jth element is 


n 
(6) Y Ck, isp, js. 
k=1 


This sum can be written 


n 


q 
(7) X aikCkj + X QC, 
k=1 k=qti 


The first sum is the i, jth element of 4,,C,,, the second sum is the i, jth 
element of 4;,C;,, and therefore the entire sum (6) is the i, ‘th element of 
A uC + A,.C,. In a similar fashion we can verify that the other submatrices 
of AC can be written as in (5). 

We note in passing that if A is partitioned as in (2), then the transpose of 
A can be written 


A; 1 A, | 


1 П 
12 22 


(8) А’ = 


If An = 0 and A», = 0, then for A positive definite and 4, square, 


Au 0 


9 А1 = 
9 0 Ag 


The matrix on the right exists because 4,, and 4,, are nonsingular. That the 
right-hand matrix is the inverse of A is verified by multiplication: 


A, 0 Ay 0 I 0 
(10) "s , 
0 А |10 А5 0 I 
which is a partitioned form of I. 
We also note that 
" A, 0 A, 0911 AVA 
( ) 0 Ay ~ 0 I ` 0 An =| ull zl. 


The evaluation of the first determinant in the middle is made by expanding 
according to minors of the last row; the only nonzero element in the sum is 
the last, which is 1 times a determinant of the same form with Г of order one 
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less. The procedure is repeated until |A,,| is the minor. Similarly, 


(12) A, Ар - 1 0 |4u An 
0 Ay 0 A, | |0 I 
=| Ан || An |. 


A useful fact is that if А, of д rows and p columns is of rank 4, there 
exists a matrix A, of p —q rows and p columns such that 


(13) А= 


is nonsingular. This statement is verified by numbering the columns of А so 
that A,, consisting of the first q columns of A, is nonsingular (at least one 
q хӯ minor of A, is different from zero) and then taking А, as (0 Г); then 


A), Ai 


=| А, |, 
онла |а] 


(14) JA] = | 


which is not equal to zero. 


Theorem А.3.1. Let the square matrix A be partitioned as іп (2) so that A, 
is square. If A, is nonsingular, let 


I-A, A>! I 0 
(15) B= 1 = с-|_ 7 J 
0 I AA, I 
Then 
Au-ApAgA,; 0 Au -AÁpÁAzÁAs Ар 
- ‚ AC= ; 
(16) Ва Ax Ay 0 A, 


Ап - An Asl Ay 0 
0 Ay 


(17) BAC = 


If A is symmetric, C = B'. 


Theorem A.3.2. Let the square matrix A be partitioned as in (2) so that A», 
is square. If A>. is nonsingular, 


(18) | A] = |4, - 4,494,114 |. 
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Proof. Equation (18) follows from (16) because |B| = 1. и 


Corollary А.3.1. For С nonsingular 


- ici (1 7C»). 


C у , У 
(19) | || = C 


Theorem A.3.3. Let the nonsingular matrix A be partitioned as in (2) so that 
A» is square. If As, is nonsingular, let A, = Aq = 41 АА. Then 


Ап? -Ajb Ay Az | 
-AzAa4gh ААА Ар Ар +45 


(20) А”! -| 


Proof. From Theorem A.3.1, 


2 а-в: |412 0 jen 
(21) - 0 Ay 
Hence 
-=l 
(22) acce °| в 
° 0 A» 
I ор, 0 ||1 -AypAx 
1-а Г 0 450 1 p 
Multiplication gives the desired result. и 


Corollary А.3.2. If x' = (x'?' x"), then 
(23) х4 = (x 2 An Az! x?) An (x — 45 Ayr) +x AZ x, 
Proof. From the theorem 


(24) 


r4- r4- 14-1 -1. 
x'A7 Ix = xD Anl ХО — x07 401 4, Az] x? 


r4- - -: -1 -i gp ауу 
-x AT Ay 4х0 +29 (AD Ag Ano Ap Az tAn y, 


which is equal to the right-hand side of (23). и 
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Theorem A.3.4. Let the nonsingular matrix А be partitioned as in (2) so that 
An is square. If Ay is nonsingular, 


- -1 - - -1 - - 
(25) (Ax - An4141) -AxzAa(A4u - An Az Az) Ay, Ap + Az. 


Proof. The lower right-hand corner of 47! is the right-hand side of (25) 
by Theorem A.3.3 and is also the left-hand side of (25) by interchange of 1 
and 2. и 


Theorem А.3.5. Let U be p X m. The conditions for I, – UU', L, — U'U, | 
and 


26 0 
(26) U I, 


to be positive definite are the same. 
Proof. We have 

I U 

U' I 


m 


(27) (w' w') 


v 
| = р'р t v'Uw -w'U'v -w'w 
w 


=p'(I„—UU')v + (U'v +w)'(U'v +w). 


The second term on the right-hand side is nonnegative; the first term is 
positive for all v + 0 if and only if Г, — U'U is positive definite. Reversing 
the roles of v and w shows that (26) is positive definite if and only if 
I, — UU’ is positive definite. и 


A.4. SOME MISCELLANEOUS RESULTS 


Theorem A.4.1. Let C be p X p, positive semidefinite, and of rank г (<р). 
Then there is a nonsingular matrix A such that 


I 0 
1 АСА’ = |." . 
o (i 3) 
Proof. Since С is of rank г, there is a (p —r) X p matrix A, such that 


(2) A;C — 0. 
Choose B (r X p) such that 


a В 
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is nonsingular. Then 


o (jen о [t 1] 


This matrix is of rank r, and therefore BCB' is nonsingular. By Corollary 
А.1.6 there is a nonsingular matrix D such that D(BCB')D' = I,. Then 


| DB| [р 0\(B 
5 А= = 
is a nonsingular matrix such that (1) holds. и 


Lemma A.4.1. If E is p Xp, symmetric, and nonsingular, there is a nonsin- 
gular matrix F such that 


(6) | FEF' = H M | 


where tlie order of I is the number of positive characteristic roots of E and the 
order of —I is the number of negative characteristic roots of E. 


Proof. From Theorem A.2.1 we know there is an orthogonal matrix G 
such that 


h, 0 0 
O h 0 
(7) GEG' = 2 
0 0 h 


where hz  zh,20»h,, zm zh, are the characteristic roots of E. 
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Then 

"m , I 0 
(9) KGEG'K' = (KG) E( KGY = f M и 


Corollary A.4.1. Let C be p Xp, symmetric, and of rank r (x p). Then 
there is a nonsingular matrix А such that 


' I 9 o) 
(10) ACA'=|0 -I Mi 
о о 0 


where the order of I is the number of positive characteristic roots of C and the 
order of —1 is the number of negative characteristic roots, the sum of the orders 
being r. 


Proof. The proof is the same as that of Theorem A.4.1 except that Lemma 
4.4.1 is used instead of Corollary A.1.6. и 

Lemma A4. Let A be n X m (п> m) such that 
(11) А'А=Г,. 
There exists an n X (n — m) matrix B such that (A В) is orthogonal. 

Proof Since A is of rank m, there exists an n X (1 — m) matrix C such 
that (A C) is nonsingular. Take D as C – AA'C; then D'A—0. Let E 


[Gn — m) x (n — m)] be such that E'D'DE =I. Then B can be taken as DE. 
и 


Lemma А.4.3. Let x be a vector of n: components. Then there exists an 
orthogonal matrix O such that 


c 
0 
(12) Ох = ME 
0 
where c — yx'x. 


Proof. Let the first row of O be (1/c)x'. The other rows may be chosen in 
any way to make the matrix orthogonal. и 
Lemma 4.4.4. Let В = (b) be a p x p metrix. Then 


ABI 
(13) a5, Ву 
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Proof. The expansion of |B| by elements of the ith row is 
p 
(14) 161 = У b, Biy- 
h=1 


Since B, does not contain b;;, the lemma follows. и 


Lemma А.4.5. Let b, = В, (c... c,) be the i, jth element of a pXp 
matrix B. Then forg = 1,...,n, 


ЯВ! p óB] 9Bi (€i... Cn) = ў Bi, 9B (c1... Sn) . 


(15) == 
9с, ка OD, 9с, LAC дс, 


Theorem А.4.2. ГА = А', 


94| 
(16) за, Ав» 
(17) aal -2A;, ij. 


ij 


Proof Equation (16) follows from the expansion of |4| according to 
elements of the ith row. To prove (17) let bj; b; =а;, i, jo 1,...,р, i SJ. 
Then by Lemma A.4.5, 


(18) on = B; +В, 

Since |A| = |B| and В, = B; = 4; = Aj; (17) follows. и 
Theorem À.4.3. 

(19) Z (x'4x) = 24s, 


where 9/ àx denotes taking partial derivatives with respect to each component of x 
and arranging the partial derivatives in a column. 


Proof. Let h be a column vector of as many components as x. Then 
(20) (x h)'A(x +h) =х' Ах h'Ax +х' АЙ +h'Ah 
—x'Ax + 2h' Ax + h' Ah. 
The partial derivative vector is the vector multiplying A’ in the second term 


on the right. и 


Definition А.4.1. Let A = (а, ) bea p X m matrix and В = (bap) beaq X" 
matrix. The pq X mn matrix with а, Б, д as the element in the i, ath row and the 
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j, Bth column is called the Kronecker or direct product of A and B and is 
denoted by A & B; that is, 


aB agB c^ aB 
a „B a4B ~ ав 
(21) AeB-| © 7 2" 
a,,B ap В Ut аһ B 


Some properties are the following when the orders of matrices permit the 
indicated operations: 


(22) (А®В)(С®Р) = (АС) ® (BD), 
(23) (А®В) '-A4&B^. 


Theorem А.4.4. Let the ith characteristic root of A (p X p) be А, and the 
corresponding characteristic vector be x; = (Xii. .., xp), and let the ath root of 
В (qq) be v, and the corresponding characteristic vector be y,, а = 1,...,q. 
Then the i, ath root of A & B is A;v,, and the corresponding characteristic vector 
is X; O Ya = Guy o xu Ia) 1=1,...,р, “=1,..., 4. 


Proof. 
ajB c0 ayB||xyuy, 
(24) (4®В)(х;®у,) = | : н 
a, В UT a,,B XpiJa 
Уа, ух, Ву, 
j 
La, ух By, 
i 
A,X; By, Xia 
= И = Аи, . и 
Àj xy; By, XpiJa 
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Theorem A.4.5 
(25) |4 & B| = |A|*| B|. 
Proof. The determinant of any matrix is the product of its roots; therefore 


(26) [4 & B| - Пл», [Пл = Ma} | 2 и 


а= 1 


Definition 4.4.2. If the px m matrix A=(a),...,4,,), then vec А= 
(a5, ..., a^, )'. 


Some properties of the vec operator [e.g., Magnus (1988)] are 


(27) vec ABC = ( C' & A)vec B, 
(28) vec ху’ =y @x. 


Theorem A.4.6. The Jacobian of the transformation E = Y^! ( from E to Y) 
is |Y| 7^, where p is the order of E and Y. 


Proof. From EY = I, we have 


ð 
2 - 
(29) (soz) ¥+ = 387) =0, 
where 
prm дер 
А 99 "77 90 
(30) (355) =| : : 
9е ру 9е рр 
90 90 
Then 
à _ - 
(31) (355) = -в (357) Е - -Y {5 Yr ! 


(34) max tr НАН'В = Y a,b 
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where €,, is a p Xp matrix with all elements 0 except the element in the 
ath row and Bth column, which is 1; and e., is ће ath column of E and eg. 
is its Bth Tow. Thus дег,/ дув = —€;,€,; Then the Jacobian is the determi- 
nant of a p? x p? matrix 


(33) mod 5—— 


|" le e| = IES E'I = ІЕЕ = E|? = 1077. м 


Theorem A.4.7. Let А and B be symmetric matrices with characteristic roots 
4,;24,2 °° 2a, and b,>b,> mb, respectively, and let H be a pxp 
orthogonal matrix. Then 


p 
30р тіп HA'H'B = У а 
j=l j=l 


p+1-j' 


Proof. Let A = H,D,H; and B=H,D,H,, where H, and Н, are orthog- 
onal and D, and D, are diagonal with diagonal elements 0,,...,0, and 
b,,..., b, respectively. Then 
(35) max tr H*AH*'B = max tr H* H,D, HL H*'H,D,H; 

H* H* 
= max tr H, H*H, D(( H,H*H,)'D 
H* 


= maxtr HD,H'D,, 
H 


where Н = H; H* H,. We have 


y 
‘Me 


(36) tr HD,H'D, (ED, H'),b 


р . 
—- bi,1) +b, X (HD,H*)j; 
]=1 


y 
T +M 


P $ umno 


i P 
= У a(bi- bi) +, Уа; 
; m 


by Lemma A.4.6 below. The minimum in (34) is treated as the negative of the 
maximum with B replaced by —B [von Neumann (1937)]. " 
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Lemma A.4.6. Let P—(p;) be a doubly stochastic matrix (р; 2 0, 
УР уру = 1, EL, p; = D. Let y, zy; 2 oc zy. Then 


k k n 
(37) Ух> У У, ру k —1,..., p. 
і=1 i=l jal 
Proof. 
k P р > 
(38) L У руу; = У gy 
i-i j=l jel 


where в; = Efi pip f= 1,...,р O<g, <1, EP, gj; = К). Then 


р k k 
(9)  Ez»-YX»--X» jene Es) + Den 
j=l jel 


j=l j=l 


k 
Eo, -y)(g-1)* X (y; -x3)8; 


j=k+1 
<0. м 


Corollary / A.4.2. Let A be a symmetric matrix with characteristic roots 
a, 2a@,> `a, Then 


k 
(40) max trR'AR= } а;. 

| R'R-I, 1 

Proof. In Theorem А 4.7 let 
41 B- [2 o] и 
(41) о ol 

Theorem А.4.8. 
(42) 9 xC| =1+хиС+0(х?). 


Proof The determinant (42) is a polynomial in x of degree p; the 
coefficient of the linear term is the first derivative of the determinant 
evaluated at x = 0. In Lemma A4.5 let n—1, с, =x, By (x) = 6, + хс, 
where 8, = 1 and &,, = 0, i + й. Then d B (x)/dx = cin B, = 1 for x = 0, and 
Bj, = О for x = 0, i + А. Thus 


d|B( x)| Р 
AP ko и 


i=l 


(43) 
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A.5. GRAM-SCHMIDT ORTHOGONALIZATION AND THE 
SOLUTION OF LINEAR EQUATIONS 


А.5.1. Gram-Schmidt Orthogonalization 


The derivation of the Wishart density in Section 7.2 included the 
Gram-Schmidt orthogonalization of a set of vectors; we shall review that 
development here. Consider the p linearly independent n-dimensional vec- 
tors v,,..., (p <n). Define w, =p}, 


(1) | w;-u- ~= Wis i=2,..., p. 
| j= Г 


Then w; + 0, = 1,..., p, because vi,...,v, are linearly independent, and 
ити, = 0, i #j, as was proved by induction in Section 7.2. Let и; = (1/llw;lDw;, 
i=1,...,p. Then w,,..., u, are orthonormal, that is, they are orthogonal and 
of unit length. Let U=(u,,...,u,). Then U'U = Г. Define t; = |lw;ll (> 0), 

2 и; , TE . 2 
(2) = Tw T Ј = 591-1, L= 2,...,р, 
and t —0, /= + 1,..., p, i—1,..., p — 1. Then Т = (tj) is a lower triangu- 
lar matrix. We can write (1) as 


i-1 . 

(3) v, = |lw;llu; + 2 ши, Eun j i-1,..,p, 
j= 

that is, 

(4) V=(v,...,¥,) = UT’. 

Then 

(5) A=V'V=TU'UT' = TT' 


as shown in Section 7.2. Note that if V is square, we have decomposed an 
arbitrary nonsingular matrix into the product of an orthogonal matrix and an 
upper triangular matrix with positive diagonal elements; this is sometimes 
known as the QR decomposition. The matrices U and T in (4) are unique. 

These operations can be done in a different order. Let V — (P... ., B®). 
For k=1,..., p — 1 define recursively 


1 1 
(6) £y = Mw P], b= Ta (RV = cy (fn, 
lle» lkk 
k-n» . 
(7) ty, = Uf Uy, ј=К+1,...,р, 
k) (К А 
(8) vi = VE) — tuus, je*k-t1,..,p. 
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Finally ¢,, = 105 || and и, = (1 /tpp vL". The same orthonormal vectors 
u,,...,4, and the same triangular matrix (íj) are given by the two proce- 
dures. 
The numbering of the columns of V is arbitrary. For numerical stability it 
is usually best at any given stage to select the largest of lus Pi to call tj. 
Instead of constructing w, as orthogonal to W;,...,W;.,, We can equiva- 
lently construct it as orthogonal to z,,.. 0-1 Let w, = v, and define 


(9) w = 0; + Y р; 
j=l 
such that 
1-1 
(10) 0 = vw; vv У fije; 
ј=1 
1-1 
Sant M af h=1,...,i-1. 
ј=1 


Let Е = (у), where у; = 1 and f, = 0, i<j. Then 
(11) W-(wi,...,w,) = ИР. 


Let D, be the diagonal matrix with ж = tj; as the jth diagonal element. 
Then U = Уру! = VF'D; |. Comparison with V = UT' shows that F = DT-!. 
Since А = ТГ’, we see that FA = DT' is upper triangular. Hence F is the 
matrix defined in Theorem A.1.2. 

There are other methods of accomplishing the QR decomposition that 
may be computationally more efficient or more stable. A Householder matrix 
has the form H =I, — 2a’, where а'о = 1, and is orthogonal and symmet- 
ric. Such a matrix H, (.е., a vector ө) can be selected so that the first 


column of H;V has 0 in all positions except the first, which is positive. The 
next matrix has the form 


(12) њ- |0 MEER MIC «)-[ L aar) 


The (n — 1)-component vector e is chosen so that the second column of НУ 


has all 0 except the first two components, the second being positive. This 
process is continued until 


T' 
(13) H, + HH = Ё | 
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where Т’ is upper triangular and 0 is (n — p) X p. Let 


(14) H'-H,- H, = (HO НФ), 


‘where H® has p columns. Then from (13) we obtain V= HT’. Since the 


decomposition is unique, H“ = U. 

Another procedure uses Givens matrices. А Givens matrix G;; is Г except 
for the elements gj; = cos 0— 2; and g; = sin Ө = ~g; i #j. It is orthogonal. 
Multiplication of V on the left by such a matrix leaves all rows unchanged 
except the ith and jth, Ө can be chosen so that the i, jth element of G, V 
is 0. Givens matrices G,,,...,G,, can be chosen in turn so С, `` GaV has 
all 0’s in the first column except the first element, which is positive. Next 
Gy,---,G,2 can be selected in turn so that when they are applied the 
resulting matrix has Ọs in the second column except for the first two 
elements. Let 


(15) G'-G4:7G46G577G,, 7 (G? GO). 
Then we obtain 
оТ | -ewr 
(16) ТОНЕ Т", 
and G® = U. 


А.5.2. Solution of Linear Equations 


In the computation of regression coefficients and other statistics, we need to 
solve linear equations 


(17) | Ах=у, 


where 4 is p X p and positive definite. One method of solution is Gaussian 
elimination of variables, or pivotal condensation. In the proof of Theorem 
A.1.2 we constructed a lower triangular matrix F with diagonal elements 1 
such that FA = А* is upper triangular. If Fy = y*, then the equation is 

(18) A* x = y*. 


In coordinates this is 


p 
(19) У atx =y. 
j=l 
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Let aj? = аў Јак, уў = уа, joi, i L...,p, i— 1,..., p. Then 


(20) xy у gt ys 


these equations are to be solved successively for Xp» Xp- 1: Ху. The calcula- 
tion of FA — A* is known as the forward solution, and the solution of (18) as 
the backward solution. 

Since FAF' = A*F' = D? diagonal, (20) is A**x = y**, where А** = р-24* 
and y** = D^7 y*. Solving this equation gives 


(21) А x= Ate! ys = руж. 


The computation is 


DF 


pol ` 


(22) Х=Е Е 


p-lt “Fiy. 


The multiplier of y in (22) indicates a sequence of row operations which 
yields A^. 

The operations of the forward solution transform A to the upper triangu- 
lar matrix A*. As seen in Section A.5.1, the triangularization of a matrix can 
be done by a sequence of Householder transformations or by a sequence of 
Givens transformations. 

From F4 = A*, we obtain 


p 
(23) lal = Пар, 


i=l 


which is the product of the diagonal elements of А*, resulting from the 
forward solution. We also have 


(24) Y'A y = (Ey) D (Fy) 2y*'D^?y* 
ЖЖ 


=у ) 


The forward solution gives а computation for the quadratic form which 
occurs in T? and other statistics. 


For more on matrix computations consult Golub and Von Loan (1989). 


APPENDIX B 


Tables 


TABLE B.1 
WILKS’ LIKELIHOOD CRITERION: FACTORS C( p,m, M) 
TO ADJUST TO д2 m» WHERE М =п-р+1 


5% Significance Level 


1 | 1295 1.422 1535 1.791 
2 | 1109 1174 1241 1302 1359 1410 1458 150 150 
3 | 1058 109 1145 1190 12320 1272 130 134 137 
4 | 1036 1065 1099 1133 1167 1199 1229 1258 1286 
5 | 1025 1046 107 110 1127 1154 1179 124 1228 
6 | 1018 1035 1056 1078 1101 112 1145 1167 1188 
7 | 1014 1027 104 1063 1082 1100 1121 1139 1158 
8 | 101 1022 1036 1052 1068 1085 1102 119 1135 
9 | 100 1018 1030 1048 1058 103  L08 1102 1117 
10 | L007 1015 1025 1037 1050 1069 1076 1089 1103 
12 | 1005 1011 1019 1028 1038 1048 1059 1070 1081 
15 | 1003 1008 1013 1000 1027 1035 104 1052 100 
20 | 1000 1004 1008 1012 1017 102 1028 1034 1040 
30 | 1001 1002 1004 106 100 101 105 108 107 
60 | 1.000 1001 100 1002 100 1003 104 106 1007 
e | 1000 ` 1000 à 100 100 100 100 100 100 100 
х2 


12.5916 21.0261 28.8693 36.4150 43.7730 50.9985 58.1240 65.1708 72,1532 


Ап Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 
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TABLE B.1 (Continued) 


` 5% Significance Level 


10 12 14 


1 2.021 2.067 1.407 1.451 1.517 1.583 1.644 1.700 1.751 
2 1.580 1.616 1.161 1.194 1.240 1.286 1.331 1.373 1.413 
3 1.408 1.438 1.089 1.114 1.148 1.183 1.218 1.252 1.284 
4 1.313 1.338 1.057 1.076 1.102 1.130 1.159 1.186 1.213 
5 1.251 1.040 1.055 1.076 1.099 1.122 1.145 1.168 
6 1.208 1.030 1.042 1.059 1.078 1.097 1.118 1.137. 
7 1.176 1.193 1.023 1.033 1.047 1.063 1.080 1.097 1115 
8 L151 1.167 1,018 1.027 1.038 1.052 1.067 1.082 1.097 · 
9 1.132 1.147 1.015 1.022 1.032 1.044 1.057 1.070 1.084 
10 1.116 1.012 1.018 1.027 1.038 1.049 1.061 1.073 
12 1.092 1.009 1.014 1.020 1.029 1.038 1.047 1.058 
15 1.069 1.078 1.006 1.009 1.014 1.020 1.027 1.035 1.042 
20 1.046 1.052 1.003 1.006 1.009 1.013 1.017 1.022 1.027 
30 1.025 1.029 1.002 1.003 1.004 1.006 1.009 1,011 1,014 
60 1.008 1.009 1.000 1.001 1.001 1.002 1.003 1.003 1.004 
oo 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
xi, 79.0819 15.5073 26.2962 36.4150 46.1943 55.7585 65.1708 74.4683 


TABLE B.1 (Continued) 


5% Significance Level 


12 


i 1.799 1.843 1.884 1.503 1.483 1.514 1.556 1.600 1.643 
2 1450  .1485 1.518 1.209 1.216 1.245 1.280 1.315 1.350 
3 1.314 1.343 1.371 1.120 1.130 1.154 1.182 1.211 1.240 
4 1.239 1.264. 1.288 1.079 1.089 1.108 1.131 1.155 1.179 
5 1.190 1.233 1.056 1.065 1.081 1.100 1.120 1.141 
6 1.157 1.176 1.194 1.042 1.050 1.063 1.079 1.097 1.114 
7 1.132 1.149 1.165 1.033 1.040 1.051 1.065 1.080 1.095 
8 1.113 1.128 1.143 1.026 1.032 1.042 1.054 1.067 1.081 
9 1.098 1.111 1.125 1.022 1.027 1.035 1.046 1.057 1.070 
10 1.086 1.098 1.110 1.018 1.023 1.030 1.039 1.050 1.061 
12 1.068 1.078 1.088 1.013 1.017 1.023 1.030 1.038 1.047 
15 1.050 1.058 1.066 1.009 1.011 1.016 1.021 1.028 1.034 
20 1.033 1.039 1.045 1.005 1.007 1.010 1.013 1.018 1.022 
30 1.018 1.021 1.024 1.002 1.003 1.005 1.007 1.009 1.012 
60 1.005 1.007 1.008 1.001 1.001 1.001 1.002 1.003 1.004 
oo 1.000 1.000 1.000 1.000 1.000 1000 1.000 1.000 1.000 
Xin 83.6753 92.8083 101.879 | 18.3070 31.4104 43.7730 55.7585 67.5048 79.0819 
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TABLE В.1 (Continued) 


5% Significance Level 


1 1.587 1.520 1543 1.57 1.605 | 1.662 1.550 
2 1254 1.255 1279 1307 1335 | 1297 1263 
3 1.150 1.163 1184 1208 1.232 | 1178 1165 
4 1100 1116 1134 1154 1175 | 1121 1.116 
5 1.072 1.088 1103 1120 1138 | 1089 1.087 
6 1.055 1.069 1082 1.007 1113 | 1068 1.068 
7 1.043 1.056 1068 1.081 1.095 | 1.054 1.055 
8 1.035 1.046 1.057 1.068 1.081 1044 1045 
9 | 1.082 1.095 1.029 1.039 1048 1.059 1.070 | 1.036 1.038 
10 | 1.072 1083 | 1.024 1.034 1042 1051 1.061 | 101 1.0 
12 | 1.057 1.066 1.018 1.025 1032 1.040 1.048 | 1.023 1.024 
15 | 1.042 1.049. | 1.012 1.018 1.023 1.029 1.035 | 1.016 107 
20 | 1.027 1.033 1.007 1.011 1014 1.018 1.023 | 1.010 Lon 
30 | 1.014 1.018 | 1.003 1.006 1007 1010 1012 | 1.005 1005 
60 | 1.004 1006 | 1.001 1.002 1002 1003 1004 | 1.001 1.001 
со | 1000. 1.000 | 1.000 100 1000 1.000 1.000 | 1.000 . 


21.0261 50.9985 65.1708 79.0819 92.8083 | 23.6848 41.3371 


xi, |90.5312 101.879 


TABLE B.1 (Continued) 


$% Significance Level 


X 
PA 


A ^ Row MS nm 


me 
ab фон 


20 


83.6753 | 28.8693 50.9985 72.1532} 31.4104 
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юм t^ & 0 t | 


1.000 
1.000 


16.8119 


2.158 
1.616 
1.420 
1.315 
1.249 


1.204 
117 
1.146 
1.127 
1.111 


1.087 
1.065 
1.043 
1.023 
1.007 
1.000 


81.0688 88.3794 95.6257 


4 


1.514 
1.207 
1.116 
1.076 
1.054 


1.040 
1.031 
1.025 
1.021 
1.017 


1.012 
1.009 
1.005 
1.002 
1.001 
1.000 


2.216 
1.657 
1.453 
1.344 
1.274 


1.226 
1.190 
1.163 
1.142 
1.125 


1.099 
1.074 
1.049 
1.027 
1.009 
1.000 


TABLE В.1 (Continued) 
1% Significance Level 


6 


р= 3 
8 10 12 14 16 


1.649 1.763 1.862 1.949 2.026 2.095 
1.282 1.350 1.413 1.470 1.523 1.571 
1167 1.216 1.262 1.306 1.346 1.384 
1113 1.150 1187 1.22] 1.254 1.285 
1.082 1.112 1.141 1.170 1.198 1.224 


1.063 1.087 1.112 1.136 1.159 1.182 
1.050 1.070 1.091 1.111 1.132 1.152 
1.041 1.058 1.075 1.093 1.111 1.129 
1.034 1.048 1,064 1.080 1.095 1111 
1.028 1.041 1.055 1.069 1.082 1.097 


1.021 1.031 1.042 1.053 1.064 1.076 
1.014 1.021 1.030 1.038 1.047 1.056 
1.009 1.013 1.019 1.024 1.030 1.036 
1.004 1.007 1.009 1.012 1.016 1.019 
1.001 1.002 1.003 1.004 1.005 1.006 
1.000 1.000 1.000 1.000 1.000 1.000 


26.2170 34.8053 42.9798 50.8922 58.6192 66.2062 73.6826 


2.269 


TABLE В.1 (Continued) 


1% Significance Level 


1.550 


1.628 1.704 1.774 


1.696 1.192 1.229 1.279 1.330 1.379 
1.485 1.106 1132 1.168 1.207 1.244 
1.371 1.068 1.088 1.115 1.146 1.176 
1.297 1.047 1.063 1.085 1.109 1.134 
1.246 1.035 1.048 1.066 1.086 1.107 
1.209 1.027 1.037 1.052 1.070 1.088 
1.180 1.021 1.030 1.043 1.058 1.073 


1.157 


1.017 1.025 1.036 1.048 1.062 


1.139 1.014 1.021 1.030 1.041 1.054 
1110 1.010 1.015 1.023 1.031 1.041 
1.083 1.007 1.010 1.016 1.022 1.029 
1.056 1.004 1.006 1.010 1.014 1.019 
1.031 1.002 1.003 1.005 1.007 1.009 
1.010 1.000 1.001 1.001 1.002 1.003 
1.000 1.000 1.000 1.000 1.000 1.000 


20.0902 31.9999 42.9798 53.4858 63.6907 


TABLE B.1 (Continued) 


196 Significance Level 


сч C. UC ov Ton 


TABLE B.1 (Continued) 
1% Significance Level 


1 1.672 1.721 1.768 1.813 1.855 1.707 1.631 1.656 
2 1.321 1.359 1.396 1.431 1.465 1.300 1.294 1.319 
3 1.204 1.235 1.265 1.294 1.323 1.175 1.183 1.205 
4 1.145 1.171 1.196 1.221 1.245 1.116 1.129 1.148 
5 1.110 1.131 1.153 1.174 1.196 1.084 1.097 1.113 
6 1.087 1.105 1.124 1.143 1.161 1.063 1.076 1.090 
7 1.071 1.087 1.103 1.119 1.136 1.050 1.061 1.074 
8 11.059 1.073 1.087 1.102 1.116 1.040 1.051 1.062 
9 1.050 1.062 1.075 1.088 1.101 1.033 1.043 1.052 
10 1.043 1.054 1.065 1.077 1.089 1.028 1.037 1.045 
12 1.033 1.041 1.051 1.060 1.070 | 1.021 1.028 1.035 
15 1.023 1.030 1.037 1.044 1.052 1.014 1.020 1.024 
20 1.015 1.019 1.024 1.029 1.034 | 1.008 1.012 1.015 
30 1.007 1.010 1.012 1.015 1.019 1.004 1.006 1.008 
60 1.002 1.003 1.004 1.005 1.006 | 1.001 1.002 1.002 
00 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
Xin 63.6907 76.1539 88.3794 100.425: 112.329) | 26.2170 58.6192 73.6826 
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1 1.687 
2 1.348 
3 1.230 
4 1.169 
5 1.131 
6 1.106 
7 1.087 
8 1.074 
9 1.063 
10 1.055 
12 1.042 
15 1.030 
20 1.019 
30 1.010 
60 1.003 
oo 1.000 
Xin 


M\m 
1 1.879 
2 1.394 
3 1.238 
4 1.163 
5 1.120 
6 1.092 
7 1.074 
8 |. 1.060 
9 1.050 
10 1.043 
12 1.032 
15 1.022 
20 1.013 
30 1.007 


60 1.002 
00 1.000 


TABLE B.1 (Con 


1.722 1.797 
1.378 1,348 
1.255 1.207 
1.191 1.140 
1.102 
1.078 
1102 1.062 
1.086 1.050 
1.075 1.042 
1.035 
1.026 
1.037 1.018 
1.024. 1.011 
1.013 1.005 
1.004 1.001 
1.000 1.000 


88.3794 102.816 


1% Significance Level 


1.667 
1.305 
1.188 
1.130 
1.097 


1.076 
1.061 
1.050 
1.042 
1.036 


1.027 
1.019 
1.012 
1.006 
1.002 
1.000 


TABLE В.1 (Continued) 


1.646 
1.326 
1.215 
1.158 
1.123 


1.099 
1.082 
1.069 


1.059 
1.051 


1.040 
1.028 
1.018 
1.009 
1.003 
1.000 


31.9999 93.2168 
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tinued) 


1.642 
1.306 
1.194 
1.138 
1.105 


1.083 
1.067 
1.056 
1.047 
1.041 


1.031 
1.022 
1.014 
1.007 
1.002 
1.000 


1% Significance Level 


1.648 
1.321 
1.210 
1.152 
1.117 


1.094 
1.077 
1.065 
1.055 
1.048 


1.037 
1.025 
1.017 
1.00% 
1.03 
1.000 


р= 9 р = 10 

2 4 6 2 
1.953 1.740 1.671 2.021 
1.436 1.355 1.333 1.476 
1.267 1.226 1.218 1.296 
1.185 1161 1.158 1.207 
1.138 1.122 1.122 1.155 
1107 . 1.096 1.098 1.121 
1.086 1.078 1.080 1.098 
1.070 1.065 1.067 1.081 
1.059 1.055 1.058 1.068 
1.050 1.047 1.050 1.058 
1.038 1.036 1.037 1.044 
1.026 1.026 1.027 1.031 
1.016 1.016 1.017 1.019 
1.008 1.008 1.009 1.010 
1.002 1.002 1.003 1.003 
1.000 1.000 1.000 1.000 
34.8053 58.6192 81.0688 |37 5662 


10 


1.665 
1.342 
1.229 
1.169 
1.132 


1.107 
1.089 
1.075 
1.065 
1.056 


1.044 
1.032 
1.020 
1.011 
1.003 
1.000 


29.1412 48.2782 66.2062 83.5134 100.425 


TABLES OF SIGNIFICANCE POINTS FOR THE LAWLEY- HOTELLING TRACE TEST 
n 

pr{ = We xe} -a 
m 


2 
3 | 58.428 
4 | 23.999 
5 | 15.639 
6 | 12.475 
7 | 10.334 
8 | 9207 
10 | 7.909 
12 | 7.190 
14 | 6.735 
18 | 6.193 
20 | 6.019 
25 | 5.724 
30 | 5.540 
35 | 5.414 
40 | 5.322 
50 | 5.198 
60 | 5.118 
70 | 5.062 
80 | 5.020 
100 | 4.963 
200 | 4.851 
oo 4.744 


*Multiply by 102. 


10.659* 
58.915 
23.312 
14.864 
11.411 
9.594 
8.488 


7.224 
6.528 
6.090 
5.571 
5.405 


5.124 
4.949 
4.829 
4.742 


4.625 
4.549 
4.496 
4.457 


4.403 
4.298 


4.197 


11.098* 
59.161 
22.918 
14.422 
10.975 
9.169 
8.075 


6.829 
6.146 
5.717 
5.209 
5.047 


4.774 
4.604 
4.488 
4.404 


4.290 
4.217 
4.165 
4.127 


4.075 
3.974 


3.877 


TABLE В.2 


11.373* 
59.308 
22.663 
14.135 
10.691 
8.893 
7.805 


6.570 
5.894 
5.470 
4.970 
4.810 


4.542 
4.374 
4.260 
4.178 


4.066 
3.994 
3.944 
3.907 


3.856 
3.157 


3.661 


p-2 
6 
11.562* 
59.407 
22.484 
13.934 
10.491 
8.697 
7.614 


6.386 
5.715 
5.294 
4.798 
4.640 


4.374 
4.209 
4.096 
4.014 


3.904 
3.833 
3.783 
3.147 


3.696 
3.598 


3.504 


5% Significance Level 


11.804* 
59.531 
22.250 
13.670 
10.228 
8.440 
7.361 


6.141 
5.474 
5.057 
4.566 
4410 


4.147 
3.983 
3.872 
3.791 


3.682 
3.611 
3.562 
3.526 


3.476 
3.380 


3.287 


11.952* 
59.606 
22.104 
13.504 
10.063 
8.277 
7.201 


5.984 


| 5.320 


4.905 
4.416 
4.260 


3.998 
3.835 
3.724 
3.643 


3.535 
3.465 
3.416 
3.380 


3.330 
3.234 


3.141 


12.052* 
59.655 
22.003 
13.391 
9.949 
8.164 
7.090 


5.875 
5.212 
4,798 
4.309 
4.154 


3.892 
3.729 
3.618 
3.538 


3.429 
3.359 
3.310 
3.274 


3.224 
3.127 


3.035 


12.153* 
59.705 
21.901 
13.275 
9.832 
8.048 
6.975 


5.761 
5.100 
4.686 
4.198 
4.042 


3.780 
3.617 
3.505 
3.425 


3.315 
3.245 
3.196 
3.159 


3.109 
3.012 


2.918 
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TABLE B.2 (Continued) TABLE B.2 (Continued) 


1% Significance Level 


5% Significance Level 


Ee —— 
p=3 
6 8 10 


р= 2 


п\т| 2 3 4 5 6 8 10 12 15 5 12 15 20 


2467! 2.667! 2.776! 2.844! 2.891 2.952! 2.9891 3.014! 3.039! 
2.985* 2.990*  2.992* 2.994"  2.995* 2.996* 2.997* 2.997* 2.998* 
74.275 71.026 69244 68116 67.337 66.332 65.712 65.290 64.862 


2 25.930* 26.996* 27.665* 28.125* 28.712* 29.073* 29.316* 29.561* 29.809: 
3 

4 

5 | 38.295 35.567 34.070 33.121 32.465 31.615 31.088 30.729 30.364 

6 

7 

8 


1188* 1.193"  1196*  1.198* 1.200" 1.202*  1.203*  1204* 1205' 
42474 41.764 41.305 40.983 40.562 40.300 40.120 39.937 39.750 
25.456 24.715 24235 23.899 23.458 23.182 22.992 22.799 22.600 
18.752 18.056 17.605 17.288 16.870 16608 16.427 16.241 16.051 
15.08 14.657 14.233 13.934 13.540 13.290 ' 13.118 12.941 12.758 


26.1048 23.794 22.517 21.706 21.143 20.413 19.958 19.648 19.332 
20.388 18.326 17.191 16.469 15.967 15.313 14.905 14.626 14.341 
17.152 15.268 14.229 13.567 13.106 12.504 12.127 11.868 11.603 


10 | 11893 - 11.306 10.921 10.649 10.287 10.057 9.897 9.732 9.560 
10 113.701 12.038 11120 10.531 10121 9.582 9243 901) 8.769 12 | 10229 9.682 9323 9.068 8727 8.509 8.357 8198 8.033 
12 11.920 10.388 9.541 8.996 8.615 8113 7.796 7.577 7351 14 | 9255 - 8.736 8.394 8149 7.822 7.612 7465 7.311 7150 
14 [10.844 9.399 8597 8.082 7.720 7242 6939 6.729 6.511 16| 8.618 8118 7.788 7.553 7236 7031 6.887 6.736 6.577 
18 | 9617 8278 7.533 7.053 6.714 6.265 5.979 5.780 5.572 18 | 8170 . 7.685 7.364 7135 6.825 6.624 6483 6.334 6177 
20 | 9.236 7.932 7206 6.736 6.406 5.966 5.685 5.489 35.284 20| 7838 7365 7.051 6826 6.522 6325 6.185 6.038 5.882 
25 | 8.604 7.360 6666 6217 5.899 5.476 5.204 5.013 4.813 25 | 7294 6.841 6.539 6323 6.029 5.837 5.700 5.556 5.401 
30 | 8219 7013 6.339 5.903 5.593 5180 4914 4726 4.529 30| 6965 6524 6231 6020 5.32 5.543 5409 5.265 5.112 
35 | 7.959 6.780 6.120 5.692 5.389 4982 4720 4535 4339 35 | 6745 6.313 6025 5818 5.534 5.348 5214 5.072 4919 
40 | 7.773 6613 $5964 5.542 5.243 4841 4582 4398 4204 40 | 6.588 6162 5.878 5.673 5.393 5.208 5.076 4934 4781 
50 | 7.523 6389 5.754 5.341 5.048 4653 4397 4216 4023 50 | 6377 5961 5.682 5481 5.205 5022 4891 4750 4.597 
60 | 7363 6.247 5.621 5.214 4.924 4534 4280 4100 3.908 60 | 6243 5832 5.558 5.359 5.086 4.904 4.774 4633 4.480 


70 | 7.252 6.148 5.529 5.125 4.838 4.451 4.199 4020 3.829 


6.150 5.744 5.471 5.274 5.003 4.823 4.693 4.553 4.399 
80 | 7.171 6.075 5.461 5.061 4.715 4.391 4.140 3.961 3.770 


6.082 5.679 5.408 5.212 4.943 4.763 4.634 4.493 4.339 
100 | 7.059 5.976 5.369 4.972 4.690 4.308 4.059 3.881 3.691 


5.989 5.590 5.322 5.128 4.860 4.682 4.552 4.413 4.258 
200 | 6.843 5.785 5.191 4.803 4.525 4.150 3.903 3.127 3.538 


5.810 5.419 5.156 4.965 4.702 4.525 4.397 4.257 4102 


5.640 5.256 4.999 4.812 4.552 4.377 4.250 4110 3.954 
*Multiply by 10?. 


oo 6.638 5.604 5.023 4.642 4.369 4.000 3.751 3.582 3.393 


+Multiply by 10°. 
*Multiply by 10°. 


658 659 


59.507 
37.994 
28.308 


19.737 
15.973 
13.905 
12.610 
11.729 
11.091 


10.075 
9.479 
9.087 
8.811 
8.448 
8.220 
8.063 
7.948 


7.793 
7.498 


7.222 


+Multiply by 10^. 
*Multiply by 102. 


660 


6.484! 
5.990* 
1.274* 


4 


6.7501 
5.995* 
1.242* 


57.032 
35.993 
26.599 


18.355 
14.765 
12.803 
11,581 
10.751 
10.152 


9.201 
8.644 
8.280 
8.023 
7.686 
7.474 
7.329 
7.224 


7.081 
6.808 


6.554 


TABLE B.2 (Continued) 


1% Significance Level 


p=3 
5 6 8 10 12 15 20 
6.917" 7031! 71781 7267! 7328 7389! 74511 
5.998* 6.000* 6.002*  6.003*  6.005* 6.006* 6.007* 
1.2227 1.08* 1.190* 1179* 1172* 1164* 1.156* 
55.462 54.377 52.973 52102 51.509 50.906 50.292 
34.721 33.840 32.695 31.984 31498 31.002 30.496 
25.511 24.7555 23.771 23157 22.737 22.308 21.868 
17.471 16.855 16.050 15.544 15197 14840 14472 
13.990 13.448 12.7377 12288 11978 11.659 11328 
12.096 11.599 10.945 10.530 10.243 9946 9638 
10.918 10.452 9.836 944+ 9172 8.890 8.596 
10.120 9.676 9.087 8.712 8450 8178 17.893 
9.545 9117 8.549 8186 7.932 7668 7.390 
8.634 8.233 7.6989 7.356 7115 6.802 6.596 
8.102 7718 7205 6874 6641 6395 6135 
7.755 7.382 6.88 6.560 6.332 6091 5.834 
7.511 7146 6.650 6.339 6115 587 5.623 
7189 6.836 6360 6050 5.831 5.597 5.346 
6.988 6.642 6.174 5.870 5653 5422 5172 
6.850 6.509 6.047 5.746 5.531 5.302 5.053 
6.750 6412 5.955 5.6586 5443 5215 4967 
6.614 6.281 5.830 5.534 5.323 5096 4850 
6.356 6.032 5.593 5304 5.096 4873 4.627 
6.116 5.801 5.373 5.089 4885 4664 4.419 


4 
5 | 1.996* 
6 165.715 
7 | 37.343 
8 | 26.516 
10 | 17.875 
12 | 14.338 
14 | 12.455 
16 | 11.295 
18 | 10.512 
20 | 9.950 
25 | 9.059 
30 | 8.538 
35 | 8.197 
40 | 7.957 
50 | 7.640 
60 | 7.442 
70 | 7.305 
80 | 7.206 
100 | 7.071 
200 | 6.814 
© 6.574 
*Multiply by 102. 


51.204* 
2.001* 
64.999 
36.629 
25.868 


17.326 
13.848 
12.002 
10.868 
10.104 

9.556 


8.688 
8.182 
7.852 
7.619 
7.313 
7.120 
6.988 
6.892 


6.762 
6.514 


6.282 


TABLE B.2 (Continued) 


5% Significance Level 


р=4 
10 


12 


20 


25 


52.054* 53.142“ 53.808* 54.258* 5471*  5517* 55.46* 
2005* 2.009* 2.011* 2013* 2.015* 2.016* 2.017* 
64.497 63.841 63.432 63.151 62.866 62.573 62.396 
36129 35.474 35.064 34.782 34.495 34.200 34.019 
25.413 24.814 24.437 24178 23.912 23.639 23.471 
16.938 16.424 16.098 15.872 15.640 15.399 15,250 
13.500 13.037 12.741 12.535 12.321 12.099 11.961 
11.680 11.248 10.972 10.778 10.577 10.366 10.234 
10.563 10.154 9.890 9.705 9.512 9.309 9.181 
9.812 9.419 9.165 8.986 8.798 8.600 8.475 
9.274 8.893 8.645 8.471 8.287 8.093 7.970 
$.422 8.062 7.826 7.639 7.482 7.293 17.173 
7.927 7.578 7.350 7.188 7.015 6.829 6.710 
7.603 7.263 7.040 6.880 6.710 6.526 6.408 
7.375 7.041 6.821 6.664 6.495 6.313 6195 
7.075 6.750 6.535 6.380 6.214 6.033 5.916 
6.887 6.568 6.356 6.203 6.038 5.858 | 5.740 
6.758 6.443 6.232 6.081 5.917 5.738 5.620 
6.665 6.351 6.143 5.992 5.829 5.650 5.532 
6.537 6.228 6.021 5.872 5.710 5.531 5413 
6.295 5.993 5.191 5.644 5.484 5.305 5.186 
6.069 5.714 5.576 5.431 5.272 5.094 4.974 


661 


п\т 4 


4 | 12.491! 


9.999* 

1.938* 
85.053 
51.991 


won GU 


10 | 29.789 
12 | 21.965 
14 | 18.142 
16 | 15.916 
18 | 14.473 
20 | 13.466 


25 | 11.924 
30 | 11.055 
35 | 10.499 
40 | 10.114 


50 | 9.614 
60 | 9.305 
70 | 9.095 
80 | 8.944 


100 | 8.739 
200 | 8.354 
oo 8.000 


T Multiply by 10^. 
*Multiply by 107. 


662 


TABLE B.2 (Continued) 


5 


12.800* 
10.004* 
1.906* 
82.731 
50.178 


28.478 
20.889 
17.199 
15.059 
13.674 
12.710 


11.237 
10.409 
9.880 
9.514 


9.040 
8.747 
8.549 
8.405 


8.211 
7.848 
7.513 


6 


13.012! 
10.008* 
1.885* 
81.125 
48.921 


27.566 
20.138 
16.539 
14.457 
13.112 
12.177 


10.751 
9.951 
9.440 
9.087 


8.631 
8.319 
8.158 
8.020 


7.833 
7.484 
7.163 


1% Significance Level 


8 


13.283! 
10.012* 
1.857* 
79.047 
47.290 


26.376 
19.154 
15.670 
13.662 
12.368 
11.470 


10.103 
9.338 
8.851 
8.514 


8.079 
7.311 
7.630 
7.498 


7.321 
6.990 
6.686 


р=4 
10 


13.449! 
10.014* 
1.840* 
71.159 
46.276 


25.632 
18.534 
15.121 
13.157 
11.894 
11.018 


9.687 
8.943 
8.470 
8.142 


7.720 
7.460 
7.284 
7.157 


6.985 
6.664 
6.369 


12 


13.561" 
10.016* 
1.828* 
76.882 
45.583 


25.121 
18.108 
14.742 
12.807 
11.564 
10.703 


9.395 
8.665 
8.200 
7.879 


7.465 
7.210 
7.037 
6.912 


6.744 
6.429 
6.140 


15 


` 13.671 


10.018* 
1.816* 

75.989 

44.877 


24.597 
17.668 
14.349 
12.444 
11.221 
10.374 


9.089 
8.372 
7.915 
7.600 


7.194 
6.943 
6.774 
6.651 


6.486 
6.176 
5.892 


20 


13.79! 
10.02* 
1.804* 
75.082 
44.156 


24.060 
17.215 
13.943 
12.066 
10.863 
10.030 


8.766 
8.060 
7.611 
7.301 


6.902 
6.655 
6.488 
6.367 


6.204 
5.898 
5.616 


25 


13.87! 
10.02* 
1.797* 
74.522 
43.715 


23.731 
16.936 
13.691 
11.831 
10.639 

9.814 


8.562 
7.863 
7.418 
7.110 


6.713 
6.468 
6.301 
6.181 


6.019 
5.714 
5.432 


81.991* 
3.009* 

93.762 

51.339 


27.667 
20.169 
16.643 
14.624 
13.326 
12.424 


11.046 
10.270 
9.774 
9.429 


8.982 
8.706 
8.517 
8.381 


8.197 
7.850 
7.531 


f Multiply by 104. 
*Multiply by 107. 


6 


83.352* 
3.014* 

93.042 

50.646 


27.115 
19.701 
16.224 
14.239 
12.963 
12.078 


10.728 
9.969 
9.484 
9.147 


8.711 
8.441 
8.257 
8.124 


7.945 
7.607 
7.295 


TABLE B.2 (Continued) 


8 


85.093* 
3.020* 

92.102 

49.739 


26.387 
19.079 
15.666 
13.722 
12.476 
11.612 


10.297 
9.559 
9.088 
8.761 


8.339 
8.077 
7.899 
7.770 


7.597 
7271 
6.970 


10 


86.160* 
3.024* 

91.515 

49.170 


25.927 
18.683 
15.309 
13.389 
12.161 
11.310 


10.016 
9.291 
8.828 
8.507 


8.092 
7.836 
7.661 
7.535 


7.365 
7.045 
6.750 


5% Significance Level 


p=5 
12 
86.88! 
3.0271 
91.113 
48.780 


25.610 
18.409 
15.059 
13.157 
11.939 
11.097 


9.817 
9.099 
8.642 
8.325 


7.915 
7.662 
7.489 
7.365 


7.197 
6.881 
6.590 


15 


3.0291 
90.705 
48.382 


25.284 
18.124 
14.800 
12.914 
11.708 
10.874 


9.606 
8.896 
8.444 
8.130 


7.725 
7.474 
7.304 
7.181 


7.014 
6.702 
6.414 


20 


3.0321 
90.29 
47.973 


24.947 
17.830 
14.530 
12.659 
11.463 
10.637 


9.381 
8.679 
8.230 
7.919 


7.518 
7.269 
7.100 
6.978 


6.813 
6.503 
6.217 


663 


20.495* 
15.014* 
2.735* 
1.150* 


48.048 
31.108 
24.016 
20.240 
17.929 
16.380 


14.107 
12.880 
12.115 
11.593 


10.928 
10.523 
10.251 
10.055 


9.793 
9.306 
8.863 


"Multiply by 102. 


664 


20.834* 
15.019* 
2.704* 
1.128* 


46.670 
30.065 
23.145 
19.472 
17.228 
15.727 


13.529 
12.345 
11.607 
11.105 


10.465 
10.076 
9.814 
9.626 


9.374 
8.907 
8.482 


TABLE B.2 (Continued) 


21.267* 
15.025* 
2.665* 
1.099* 


44.877. 
28.701 
22.001 
18.459 
16.302 
14.862 


12.759 
11.629 
10.926 
10.448 


9.841 
9.471 
9.223 
9.045 


8.806 
8.363 
7.961 


10 


21.53* 
15.029* 
2.640* 
1.081* 


43.758 
27.846 
21.279 
17.817 
15.713 
14.310 


12.265 
11.167 
10.486 
10.022 


9.434 
9.076 
8.835 
8.663 


8.432 
8.004 
7.615 


1% Significance Level 


р=5 
12 


15.033* 
2.623* 
1.069* 


42.992 
27.257 
20.781 
17.373 
15.304 
13.925 


11.918 
10.842 
10.174 

9.720 


9.144 
8.794 
8.559 
8.390 


8.164 
7.745 
7.365 


15 


15.03* 
2.606* 
1.057* 


42.210 
26.653 
20.268 
16.913 
14.878 
13.525 


11.555 
10.500 
9.845 
9.401 


8.836 
8.493 
8.263 
8.097 


7.876 
7.465 
7.093 


15.06 * 
2.590* 
1.044* 


41.408 


26.031 
19.736 
16.435 
14.435 
13.105 


11.172 
10.136 
9.494 
9.058 


8.504 
8.167 
7.941 
7.779 


7.561 
7157 
6.790 


TABLE B.2 (Continued) 


12 


5% Significance Level 


p=6 
15 20 25 

43.103 42.626 42.334 
26.843 26.451 26.209 
20.489 20.144 19.929 
17.202 16.886 16.688 
` 15.218 14.921 14.735 
13.899 13.615 13.436 
11.975 11.711 11,544 
10.939 10.687 10.526 
10.293 10.049 9.892 
9.853 9.614 9.460 
9.293 9.060 8.908 
8.951 8.721 8.572 
8.720 8.494 8.345 
8.555 8.330 8.182 
8.333 8.110 17.963 
7.919 7.701 7.555 
7.689 7.473 7.328 
7.616 7.400 7.255 
7.543 7.328 7.183 


30 


42.136 
26.044 
19.783 
16.553 
14.607 
13.313 


11.428 
10.414 
9.782 
9.351 


8.801 
8.465 
8.239 
8.076 


7.857 
7.449 
7.222 
7.149 


7.077 


35 


41.993 
25.925 
19.677 
16.455 - 
14.513 
13.223 


11.343 
10.331 
9.700 
9.270 


8.721 
8.385 
8.159 
7.996 


7.777 
7.369 
7.140 
7.067 


6.994 


TABLE B.2 (Continued) 


TABLE B2 (Continued) 
5% Significance Level 


р=7 
15 20 25 30 


1% Significance Level 
p=6 
n\m 6 8 10 12 15 20 25 30 35 
10 | 86.397 83.565 81.804 80.602 79.376 78.124 77.360 76.845 76.474 
12 | 46.027 44103 42.899 42.073 41.227 40.359 39.826 39.466 39.206 
14 | 32.433 30.918 29.966 29.309 28.634 27.936 27.507 27.215 27.004 
16 | 25.977 24.689 23.875 23.311 22.729 22126 21.753 21.498 21.314 
18122292 21146 20.418 19.913 19.389 18.844 18.505 18.273 18.105 
20 | 19.935 18.886 18.217 17.752 17.267 16.761 16.445 16.229 16.071 


25 | 16.642 15.737 15.156 14.749 14.324 13.875 13.592 13.397 13.254 
30 | 14.944 14.118 13.586 13.211 12.816 12.398 12.133 11.949 11.814 
35 |13.913 13.138 12.635 12.281 11.906 11.506 11.252 11.074 10.943 
40 | 13.223 12.482 12.000 11.659 11.298 10.911 10.663 10.490 10.361 


50 | 12.358 11.661 11.206 10.882 10.538 10.167 9.927 9.759 9.633 
60 | 11.839 11.169 10.730 10.417 10.083 9.721 9.486 9.320 9.196 
70 | 11.493 10.841 10.413 10107 9.779 9.424 9.192 9.028 8.905 
80 | 11.246 10.607 10.187 9.886 9.563 9.212 8.983 8.819 8.697 


100 | 10.917 10.295 9.886 9.592 9.276 8.930 8.703 8.541 8.419 
200 | 10.312 9.723 9.333 9.052 8.748 8412 8.190 8.030 7.908 
500 | 9.980 9.409 9.030 8.755 8458 8128 7907 7.747 7.625 
1000 | 9.874 9.308 8.933 8.661 8.365 8.037 7.817 17.657 7.534 


ос 9.770 9.210 8.838 8.568 8.274 7.948 7.728 17.568 7.446 


666 


668 


185.93 182.94 180.90 178.83 176.3 17544 17457 173.92 


71.731 
44.255 
33.097 
27.273 
23.757 


19.117 
16.848 
15.512 
14.634 


13:553 
12.914 
12.492 
12.193 


11.797 
11.077 
10.685 
10.561 


10.439 


69.978 
42.978 
32.057 
26.374 
22.949 


18.440 
16.239 
14.945 
14.095 


13.049 
12.432 
12.024 
11.736 


11.353 
10.658 
10.230 
10.160 


10.043 


TABLE B.2 (Continued) 


68.779 
42.099 
31.339 
25.750 
22.388 


17.965 
15.810 
14.544 
13.713 


12.691 
12.088 
11.690 
11.408 


11.034 
10.356 
9.987 
9.869 


9.755 


1% Significance Level 


pel 


15 


67.552 
41.197 
30.599 
25.105 
21.804 


17.469 
15.360 
14.121 
13.309 


12.310 
11.720 
11.332 
11.056 


10.691 
10.028 
9.668 
9.553 


9.441 


20 


66.296 
40.269 
2€.834 
24.435 
21.195 


16.947 
14.882 
13.670 
12.876 


11.899 
11.323 
10.942 
10.673 


10.316 
9.667 
9.314 
9.202 


9.092 


25 


65.528 
39.698 
29.361 
24.019 
20.816 


16.619 
14.580 
13.383 
12.599 


11.634 
11.065 
10.689 
10.422 


10.070 
9.427 
9.078 
8,966 


8.857 


30 


65.010 
39.311 
29.039 
23.735 
20.556 


16.392 
14.370 
13.183 
12.405 


11.448 
10.882 
10.509 
10.244 


9.894 
9.254 
8.906 
8.795 


8.686 


64.636 
39.032 
28.806 
23.529 
20.367 


16.227 
14.216 
13.036 
12.262 


11.309 
10.746 
10.374 
10.110 


9.761 
9.123 
8.774 
8.663 


8.555 


TABLE В.2 (Continued) 


41.737 
31242 
25.847 
22.605 


18.324 
16221 
14.977 
14.156 


13.142 
12.540 
12.142 
11.858 


11483 
10.798 
10.423 
10.304 


10.188 


12 


41.198 
30.788 
25.446 
22.239 


18.009 
15.934 
14.707 
13.898 


12.898 
12.305 
11.912 
11.634 


11.264 
10.589 
10.221 
19.104 


9.989 


5% Significance Level 


р=8 


15 


30.318 
25.028 
21.856 


17.677 
15.629 
14.418 
13.621 


12.636 
12.051 
11.665 
11.390 


11.026 
10.362 
9.999 
9.884 


9.771 


20 


29.829 
24.591 
21.454 


17.325 
15.303 
14.109 
13.322 


12.351 
11.774 
11.393 
11122 


10.763 
10.108 
9.751 
9.637 


9.526 


25 


29.525 
24319 
21.201 


17.102 
15.095 
13.910 
13.129 


12.165 
11.593 
11.215 
10.946 


10.590 
9.939 
9.584 
9.470 


9.360 


30 


29.318 
24.132 
21.028 


16.947 
14.950 
13.771 
12.994 


12.034 
11.465 
11.088 
10.820 


10.465 
9.816 
9.461 
9.348 


9.238 
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670 


TABLE B.2 (Continued) 


1% Significance Level 
р=8 


62.828 
42.707 
33.373 
28.129 


21.661 
18.686 
16.991 
15.900 


14.582 
13.815 
13.313 
12.960 


12.496 
11.660 
11.210 
11.067 


10.928 


15 


61.592 
41.754 
32.573 
27.425 


21.085 
18.173 
16.516 
15.451 


14.163 
13.414 
12.925 
12.580 


12.127 
11311 
10.871 
10.732 


10.597 


20 


60.323 
40.771 
31.745 
26.691 


20.480 
17.631 
16011 
14.970 


13.711 
12.980 
12.502 
12.165 


11.722 
10.925 
10.495 
10.359 


10.227 


25 


59.545 
40.164 
31.232 
26.235 


20.100 
17.288 
15.690 
14.662 


13.420 
12.698 
12.226 
11.894 


111.457 


10.669 
10.244 
10.109 
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59.019 
39.753 
30.882 
25.924 


19.838 
17.051 
15.466 
14.447 


13.216 
12.499 
12.031 
11.701 


11.267 
10.484 
10.061 

9.927 


9.796 


35 


58.639 
39.456 
30.629 
25.697 


19.647 
16.876 
15.301 
14.288 


13.063 
12.351 
11.885 
11.556 


11.124 
10.343 
9.921 
9.787 


9.656 
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TABLE B.3 (Continued) 


p=8 


a 
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.835 10. 423 10.141 9.730 9.766 9.632 9.521 9.426 9.100 8.904 

-941 10.50? 10.214 9.995 9.824 9.685 9.570 9.472 9.134 8.935 

-10? 10.647 10.333 10.101 9.920 9.774 9.652 9.550 9.198 8.988 

-235 10.753 10. 425 10.184 9.996 9.844 9.718 9.517 9.249 9. 032 

-334 10.837 10.499 10.250 10.057 9.902 9.772 9.663 9.292 9.070 

-448 10.934 10.585 10.329 10.130 9.970 9.837 9.725 9.344 9.116 

- 580 11.048 10.688 10.423 10.218 10.053 9.917 9.801 9.409 9.176 

-734 11.183 10.811 10.538 10.326 10.154 10.016 9.897 9.494 9.254 

-822 11.261 10.883 10.605 10.390 10.218 10.075 9. 9.547 9.304 

-918 11.347 10.962 10.680 10.442 10.288 10.143 10.021 $.609 9.363 

-023 11.442 11. 051 10.765 10.544 10.367 10.221 10.098 9.482 9.435 

‚ 138 11.549 11.152 10. 82 10.638 10. 459 10.312 10.188 9.771 9. 526 

19 1]. 463 11.040 10.758 10. 521 10.328 10. 167 10.030 9.558 9.275 
2 11.598 11. 174 10. 857 10.610 10. 409 10. 241 10.099 9.512 9.321 
25 11.819 11. 340 11.021 10. 757 10.543 10.346 10.216 9.704 9.399 
2 11.991 11. 507 11. 151 10. 874 10.651 10. 447 10. 310 9.780 9.465 
3 12. 128 11.626 11.256 10.970 10.740 10.550 10.389 9.844 9.521 
.0t 3? А - 12.290 11.766 11.383 11.086 10.848 10.651 10.485 9.924 9.591 
49 18.058 14.702 13.297 12. 482 11. 935 11.536 11. 227 10.980 10.776 10.604 10. 024 9.681 
6? 18.640 15.058 13.573 12.716 12.143 11. 725 11. 403 11. 144 10.934 10.756 10.155 9.801 
E? 18.962 15.261 13.733 12.853 12.265 11.838 11. 509 11.247 11.030 10. 849 10.238 9.877 
129 19.310 15. 484 13. 909 13. 005 12. 403 11. 965 11.630 11.342 11. 142 10.956 10.335 9.970 
249 19.684 15.729 14.106 13.177 12.559 12. 112 11.769 11.495 11.271 11. 083 10. 453 10. 083 
œ 20.090 16.000 14.327 13.371 12.738 12.280 11.930 11.652 11.424 11. 233 10.597 10. 227 


13.027 12.603 12.311 12.093 11.922 11.782 11. 664 11.566 11.222 H.013 
13. 147 12.700 12.393 12. 164 11. 985 11.840 11.719 11.616 11.260 11.045 
13.340 12.857 12.526 12.282 12.091 11.937 11.808 11.699 11.325 11.100 
13.487 12.978 12.631 12. 375 12. 176 12.015 11.881 11. 768 11.380 11. 146 
13.603 13.075 12.716 12. 451 12.245 12. 079 11.941 11.824 11.426 Y. 186 
13.737 13. 168 12.816 12. 541 12. 328 12.156 12.014 11.893 11.482 11.236 
13. 895 13.323 12.936 12.651 12.430 12. 251 12. 104 11.979 11.555 11.301 
14.083 13. 486 13.083 12.786 12. 556 12.371 12.218 12.089 11.450 11.388 
14.191 13. 581 13. 169 12.846 12.632 12.443 12.288 12.156 11.710 11. 443 
14.310 13. 687 13.266 12.957 12.718 12. 525 12.368 12.234 11.781 11.511 
14.443 13.806 13. 376 13.061 12.818 12. 622 12. 441 12. 325 11.847 Vi. 594 
14.591 13.940 13. 501 13. 180 12.933 12.735 12. 572 12. 434 11.972 11.700 
14.284 13.698 13.288 12. 980 12.736 12. 537 12.371 12. 228 11.733 11. 432 
14. 476 13. 849 13. 413 13. 088 12. 832 12. 624 12. 449 12. 301 11.789 11. 478 
14.788 14. 096 13. 621 13.268 12.993 12.769 12. 584 12. 426 11.885 11.559 
15.029 14. 290 13. 784 13. 413 13. 123 12.888 12.693 12.528 11.965 11.628 
15.222 14.447 13. 920 13. 531 13.230 12. 984 12.785 12.614 12.034 11. 687 
15. 448 14. 632 14. 080 13. 674 13. 359 13. 104 12. 897 12.720 12.119 11.761 
15.718 14. 855 „848 13. 519 13. 255 13. 037 12. 852 12. 229 11.858 
. . . 067 13.722 13. 445 13. 216 13.024 12.374 11.989 
„199 13.845 13. 561 13.327 13. 130 12. 466 12.074 

.350 13.984 13.695 13. 454 13. 254 12.577 12.177 

„525 14.152 13. 853 13. 608 13. 402 12.712 12.307 

. 730 14.346 14.041 13.791 13. 581 12.881 12.472 
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TABLE B.4 
TABLES OF SIGNIFICANCE POINTS FOR THE Кох MAXIMUM Воот TEST 


Pr "Re x} =a 
m 


. 4. 675 . А . Д . 
7.710 4.834 3.742 3.154 2.782 2.523 2.333 2.186 2.069 1.973 


1. 1. 
188 up 
8.007 5.054 3.937 3.325 2.936 2.664 2.463 2.307 2.182 2.080 1.758 1. 
ROM E23 204 348 05 2 2.5589 2.397 2.268 2.161 1.823 1.841 
8.349 53% 4.176 3.540 3.133 2.847 2.634 2.468 2.335 2,225 1.875 1.686 
. .465 4.287 3.642 3.208 2.936 2.718 2.548 2.412 2.299 1.938 1.740 
4 Bes) R40 i4» 375 Sse 20 200 аваар 229 LOS 1240 
8.831 5747 4.543 3.881 3.454 3.153 2.926 2.749 2.407 2.488 2.105 1.891 
8.920 5.825 4414 3.950 3.520 3217 2.989 2.810 2.646 2.547 2.160 1.94 
9.012 5.905 4.692 4.022 3.591 3285 3.056 2.877 2.732 2.612 2.22 2.00 
4.772 4.100 3.666 3.360 3.130 2.950 2.804 2.683 2.292 2.072 
$20 £09 Lae LU зв 1029 2.884 2.743 2.373 2.154 
р=3 2 
UON T 7 00700 0M - 
a “т 1 2 3 4 5 6 7 8 ою в 2 
NI 
4 .989 4.517 3.544 3.010 2.669 2.430 2.254 2.117 2.008 1.919 1.639 1.491 
6| $098 LH) 344 зоо 20 2.501 2.319 2.178 2.045 1.973 1.682 1.526 
20 7.28 4.760 3.757 3.215 28% 2.608 2420 2.274 2.158 2.059 1.75] 1.585 
24 7.341 4.858 3.859 3.302 2.942 2.686 2.495 2.346 2.224 2.124 1.805 1.831 
28 7.410 49% 3.927 3367 3.004 2.746 2.552 2.400 2.277 2.176 1.849 1.68 
4 7. 5.004 4.001 349 3.073 2.812 2416 2.462 2.338 2.234 1.901 1.715 
05 n 759 $0M 400 217 108 280 2.689 2.534 2.408 2.303 1.962 1.771 
64 7.439 513 4169 3404 3.235 2.972 2773 2.616 2.487 2.383 2038 1.842 
B4 7.481 520 4.214 3651 3.282 3019 2.820 2.643 2.535 2.409 2.082 1.885 
124 7.74 5.268 4.245 3.701 3.332 3.009 2.8/0 2.713 2.586 2.479 2.12 1.934 
5.318 4.317 3.754 3.386 3.123 2.924 2.768 2.641 2.535 2.189 1.991 
24 yas £30 437 80 зз 318 2.983 2.828 2.701 2.594 2.253 2.09 
14 8.91 5.416 4.106 2. . . 295 2. . 1724 1. 
они: 
A $500 $11 468) 3.910 3.422 3.082 2.83] 2.636 2.481 2.354 1.954 1.740 
28 | 10.106 6.264 481 4.026 3.528 3.180 2.922 2.72 72.42 243 2.006 1.792 
. 431 4.955 4.156 3.647 3.222 3.027 2.821 2.457 2.521 2.091 1.857 
9| L| ie Sate LI те з Yd 3.148 2.938 2.768 2.628 2.182 1.937 
44 | 10790 6.815 5.296 4.449 3.940 3.568 3.291 3.075 2.901 2.757 2.295 2.040 
10.920 6.923 5.393 4.560 4.027 3.452 3372 3.153 2.977 2.832 2. M3 2.103 
124 | 11.056 7.037 5497 £658 4120 3.742 3440 329 3.060 2915 244.217 
1.196 7.157 5.603 4.763 4.20) 3.841 3.556 3.334 3.155 3.008 2.530 2.244 
"| on Fee $03 ала C4 308 3.663 3.440 3.260 3112 2.834 2.369 
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TABLE B.4 (Continued) 


. 6. - 4.005 3.468 3.098 2.827 2.620 2.455 2.322 1.908 1.693 
. ` 6.634 5. 4.128 3.579 3.199 2.920 2.705 2.535 2.396 1.965 1.740 
11.282 6.860 5.216 4.320 3.753 3.359 3.068 2.844 2.665 2.519 2.062 1.819 
11.483 7.056 5.372 4.462 3.884 3.481 3.182 2.951 2.767 2.615 2.139 1.884 
11.630 7.168 5.491 4.571 3.985 3.576 3.272 3.036 2.848 2.693 2.203 1.939 
11.790 7.334 5.625 4.695 4.101 3.686 3.376 3.136 2.943 2.785 2.280 2.006 
11.964 7.475 5.774 4.836 4.235 3.813 3. 498 3.253 3.057 2.894 2.375 2.090 
12.154 7.675 5.944 4.997 4.390 3.963 3.643 3.394 3.194 3.028 2.495 2.200 
12.255 7.774 6.038 5.088 4.477 4.048 3.727 3.476 3.274 3.107 2.568 2.268 
12.362 7.878 6.138 5.185 4.573 4.141 3.818 3.566 3.363 3.195 2.652 2.348 
6.246 5.291 4.676 4.244 3.920 3.667 3.463 3.294 2.749 2.444 
6.382 5 А 
. 5.340 4 . . 
13. 126 7.570 5.574 4 3. 885 3.444 3.122 2.877 2.683 2.526 2.044 1.795 
13.736 7.992 5.91 4.135 3. 667 3.325 3.063 2. 2.687 2.165 1.892 
14.173 8.303 6.16 4.328 3.84! 3.484 3.210 2.993 2.816 2.264 1.974 
14.501 8.541 6.36 4.480 3.980 3.612 3.329 3.104 2.921 2.347 2.043 
14.865 8.808 6.583 5.400 4.657 4.142 3.763 3.470 3.237 3.047 2.448. 2.128 
15.270 9.112 6.839 5.628 4.864 4.334 3.943 3.640 3.399 3.201 2.576 2.238 
15.723 9.458 7.136 5.895 5.11l 4.565 4.161 3.848 3.598 3.393 2.739 2.383 
15.970 9.650 7.303 4.047 5.252 4.699 4.289 3.971 3.716 3.507 2.840 2.475 
16.233 9.856 7.484 6.213 5.408 4.847 4.431 4.108 3.850 3.637 2.958 2.584 
16.513 10.079 7.682 6.395 5.580 5.012 4.591 4.264 4.002 3.786 3.097 2.717 
16.812 10.319 7. 
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11.961 7.063 5.258 4.304 3.708 3.298 2.999 2.770 2.589 2.442 1.989 1.753 
12.184 7.243 5.411 4.437 3.827 3.406 3.098 2.861 2.674 2.522 2.049 1.802 
12.513 7.516 5.647 4.647 4.016 3.580 3.258 3.011 2.814 2.653 '2.151 1.887 
12.744 7.714 5.821 4.B03 4.160 3.713 3.382 3.127 2.924 2.757 2.235 1.956 
12.915 7.863 5.954 4.925 4.272 3.817 3.481 3.220 3.012 2.842 2.304 2.015 
13.102 8.030 6.104 5.063 4.401 3.939 3.596 3.330 3.117 2.942 2.388 2.088 
13.308 8.216 6.275 5.222 4.551 4.081 3.732 3.461 3.243 3.063 2.492 2.180 
13.534 8.426 6.471 5.407 4.727 4.251 3.895 3.619 3.396 3.213 2.625 2.300 
13.657 8.541 6.579 5.511 4.827 4.348 3.990 3.711 3.486 3.301 2.706 2.376 
13.786 8.665 6.697 5.624 4.937 4.455 4.095 3.814 3.588 3.401 2.800 2.465 

6.824 5.747 5.058 4.573 4.211 3.929 3.702 3.514 2.910 2.573 

6.961 5.882 5.191 4.705 4.343 4.060 3.833 3.645 3.041 2.705 


. . Я 4.640 3.960 3.498 3.163 2.908 2.707 2. 545 
14.310 8.144 5.971 4.829 4.124 3.642 3.292 3.026 2.816 2.646 
14.974 8.619 6.332 5.133 4.389 3.879 3.507 3.222 2.997 2.814 
15.456 8.957 6.605 5.357 4.595 4. А 

15.822 9.2168 6.818 5.551 4.759 


4 
4, 
16.230 9.514 7.063 5.745 4.952 4 
16.688 9.852 7.346 6.015 5.179 4. . . А 
17.206 10. 243 7. 679 6.313 5. 452 15 4.413 4.072 3.799 3.575 
$ 
5 
5 


BBS 


=2 
AS 


3 


17.491 10.461 7.867 6.483 5.610 
17.796 10.697 8.073 6.670 5.785 
18.124 10.754 8.298 6.878 5.980 

. А 7.108 6.200 
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TABLE B.4 (Continued) 


p-8 
а |n NY 1 2 3 4 5 6 7 8 9 10 15 20 
— ot 
19 13.101 7.640 5.645 4.594 3.941 3.493 3.166 2.916 2.719 2.559 2.057 1.812 
21 13.346 7. 834 5.808 4.737 4.067 3.607 3.270 3.012 2.808 2.643 2.130 1.863 
25 13.710 8.132 6.063 4.962 4.270 3.792 3.441 3.171 2.956 2.782 2.238 1.952 
2? 13.970 8.350 6.253 5.131 4.425 3.935 3.574 3.295 3.074 2.893 2.326 2.005 
33 14.163 8.515 6.399 5.264 4.547 4.049 3.680 3.396 3.169 2.984 2.400 2.088 
.05 39 14.377 8.701 6.566 5.416 4.688 4.181 3.806 3.515 3.283 3.092 2.490 2.165 
49 14.614 8.912 6.757 5.593 4.854 4. 338 3.955 3.658 3.420 3.224 2.403 2.264 
69 14.877 9.151 6.977 5.800 5.050 4.526 4.136 3.832 3.589 3.388 2.747 2.395 
8? 15.021 9.283 7.101 5.917 5.163 4.634 4.241 3.935 3.489 3.486 2.836 2.478 
129 15.173 9.426 7.235 6.046 5.287 4.755 4.358 4.050 3.802 3.597 2.940 2.576 
249? 15.335 9.579 7.381 6.187 5.424 4.689 4.49} 4.180 3.931 3.725 3.063 2.695 
© 15.507 9.745 7.541 6.342 5.577 5.040 4.640 4.329 4.078 3.872 3.210 2.843 
д 
19 14.999 В. 435 6.119 4.92] 4.185 3.685 3.323 3.048 2.832 2.658 2.125 1.853 
21 15.463 8.743 6.357 5.119 4.355 3.836 3.458 3.171 2.944 2.762 2.201 1.913 
25 16.177 9.226 6.739 5.439 4.634 4.084 3.682 3.376 3.134 2.938 2.332 2.018 
2? 16.700 9.589 7.030 5.667 4.853 4.280 3.861 3.541 3.287 3.081 2.442 2.107 
33 17.100 9.871 7.259 5.885 5.028 4.439 4.007 3.676 3.414 3.200 2.535 2.184 
Ot 3? 17.549 10.194 7.524 6.115 5.234 4.627 4.181 3.839 3.566 3.344 2.649 2.280 
49 18. 058 10. 565 7.833 6.387 5.480 4.854 4.392 4.037 3.754 3.523 2.795 2. 405 
69 18. 640 10.998 8.199 6.713 5.778 5.131 4.653 4.284 3.990 3.749 2.984 2.573 
B? 18.962 11.242 8.408 6.901 5.952 5.294 4 808 4.432 4.132 3.886 3.105 2.680 
129 19.310 11.508 8. 638 7.109 6.146 5.478 4.983 4.601 4.295 4.044 3.246 2.810 
249 19.684 11.798 8.891 7.341 6.364 5.685 5.183 4.794 4.483 4.228 3.415 2.970 
eo 20.090 12.117 9.173 7.601 6.610 5.922 5.413 5.019 4.703 4.445 3.622 3.171 


a [а 1 2 3 4 8$ 6 7 8 93 в X 
L 


21 15.322 8.761 6.395 5.158 4.392 3.869 3.489 3.199 2.970 2.785 2.217 1.925 

23 15.604 8.979 6.577 5.315 4.53] 3.994 3.602 3.303 3.047 2.875 2.285 1.980 

27 033 9.320 6.844 5.566 4.756 4.199 3.790 3.477 3.229 3.028 2.403 2.075 

3 16.344 9.573 7.082 5.759 4.931 4.359 3.939 3.616 3.340 3.151 2.500 2.156 

580 9.769 7.252 5.912 5.071 4.488 4. 3.730 3.467 3.253 2.581 2.225 

05 41 16.843 9.992 7.448 4. 090 5.234 4.541 4.203 3.865 3.596 3.376 2.682 2.311 
51 17.140 10.247 7.676 6.299 5. 429 4.824 4.376 4. 030 3.754 2.810 2. 423 

71 17.476 10.543 7.944 6.548 5. 663 5.047 4. 590 4.235 3.952 3.719 2 977 2.572 

9 17.662 10.709 8. 097 6.691 5.800 5.178 4.716 4.358 4.070 3.834 3.081 2.448 

131 17.861 10.890 8.265 6.850 5.952 5.324 4.858 4.496 4. 3.767 3.204 2.783 

251 18.076 11.087 8.450 7.027 6.122 5.490 5.021 4.656 4.363 4.122 3.350 2.925 

eo 18.307 11.303 8.654 7.224 6.315 5.67? 5.207 4.840 4.546 4.304 3.529 3.102 

21 17.197 9.534 6.851 5.470 4.624 4.051 3. 3.322 3.075 2.877 2.271 1.962 

23 17.707 9.867 7.107 5.682 4.806 4.211 3.779 3.452 3.194 2.987 2.351 2.025 

27 18.505 10.399 7.523 6.029 5.107 4.478 4.021 3.672 3.397 3.175 2.471 2.137 

31 19.101 10.805 7.846 6.302 5.346 4.693 4.216 3.851 3.564 3.330 2.608 2.233 

19.562 11.125 8.103 6.522 5.541 4.868 4.376 4.000 3.702 3.460 2.709 2.315 

o 4l 20.088 11.495 8.405 6.782 5.772 5.078 4.570 4.180 3.87! 3.619 2.835 2.420 
t 51 20.692 11.928 8.761 7.093 6.052 5.335 4.808 4.403 4.08] 3.819 2.996 2.558 
71 21.394 12.441 9.190 7.473 6.397 5.654 5.107 4.686 4.350 4.076 3.211 2.745 

9! 21.790 12.735 9. 439 7.695 6.601 5.845 5.287 4.857 4.515 4.234 3.347 2.867 

131 22.221 13.05? 9.716 7.944 6.832 6.062 5.494 5.055 4.705 4.418 3.510 3.016 

251 22.692 13.417 10.025 8.225 7.094 6.310 5.732 5.285 4. 4.636 3.707 3.201 

А 13.816 10.373 8. 545 7.395 6.598 6.010 5.556 5.193 4.895 3.952 3.438 
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ТАВІЕ В.5 


SIGNIFICANCE POINTS FOR THE MODIFIED LIKELIHOOD RATIO TEST OF 
EQUALITY OF COVARIANCE MATRICES BASED ON EQUAL SAMPLE SIZES 
Pr( — 2108 А* > x} = 0.05 


4 


5 6 


7 8 9 


24.55 
22.00 
20.73 


19.97 
19.46 
19.10 
18.83 
18.61 


41.0 


38.06 
36.29 
35.10 
34.24 
33.59 


33.08 
32.67 
32.33 


65.91 
60.90 
57.17 
55.62 
54.05 


52.85 
51.90 
51.13 
50.50 
49.97 


р= 2 
30.09 35.45 
27.07 31.97 
25.56 30.23 


24.66 29.19 
24.05 28.49 
23.62 27.99 
23.30 27.62 
23.05 27.33 


р = 3 
510 60.7 
47.49 56.68 
45.37 54.21 
43.93 52.54 
42.90 51.34 
4211 50.42 


41.50 49.71 
41.01 49.13 
40.60 48.66 


р=4 
826 98.9 
76.56 91.89 
72.78 87.46 
70.17 84.42 
68.27 82.19 


66.81 80.49 
65.66 79.14 
64.73 78.04 
63.96 77.14 
63.31 76.38 


70.3 797 890 


65.69 74.58 83.37 
62.89 71.45 79.91 
60.99 69.33 77.56 
59.62 67.79 75.86 
58.58 66.62 74.57 


57.76 65.71 73.56 
57.11 64.97 72.15 
56.57 64.37 72.08 


115.0 131.0 — 
1070 1219 1370 
101.9 1162 130.4 
98.45 112.3 126.1 
95.91 109.5 122.9 


93.95 107.3 120.5 
92.41 105.5 118.5 
91.16 104.1 117.0 
90.12 103.0 115.7 
89.25 102.0 114.6 
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TABLE B.5 (Continued) 
ng 2 3 4 5 6 7 |n,Nq 2 3 4 5 
p= p=6 
8 13929 65.15 8946 113.0 — — 10 |49.95 8443 117.0 — 
9 136.70 6140 84.63 1072 1293 151.5 
10 | 34.92 58.79 8125 1031 124.5 1457| 11 |4743 80.69 1122 1429 
12 |45.56 7790 1086 1384 
И | 33.62 56.86 78.76 1000 120.9 1416| 13 | 4411 75.74 1057 135.0 
12 |32.62 55.37 76.83 97.68 1182 138.4] 14 } 42.96 7401 103.5 1322 
13 |31.83 54.19 75.30 95.81 116.0 135.9| 15 142.03 72.59 1016 129.9 
14 | 31.19 53.24 74.06 94.29 114.2 133.8 
15 |30.66 52.44 73.02 93.03 112.7 1321| 16 | 41.25 7141 1001 128.0 
17 | 40.59 7041 93.75 126.4 
16 |30.21 51.77 7214 91.95 111.4 130.6 | 18 |40.00 69.55 97.63 1250 
19 | 39.53 68.80 96.64 123.8 
20 139.11 68.14 95.78 122.7 
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1.000 
1.000 
1.000 
1.000 


11.0705 


1.001 
1.000 
1.000 


16.9160 


TABLE В.6 
CORRECTION FACTORS FOR SIGNIFICANCE POINTS FOR THE SPHERICITY TEST 


5% Significance Level 


1,008 
1.006 


1.004 
1.003 


1.002 
1,001 
1.001 
1.000 


23.6848 


1.420 
1.180 
1.098 
1.071 


1.039 
1.024 
1.017 
1.012 
1.010 


1.006 
1.004 


1.003 
1.002 
1.001 
1.000 


31.4104 


1.442 
1.199 
1.121 


1.060 
1.037 
1.025 
1018 
1014 


1.009 
1.006 


1.004 
1.002 
1.002 
1.000 


40.1133 


1.455 
1.214 


1.093 
1.054 


1.035 


1.025 
1.019 


1.012 
1.008 


1.005 
1.003 
1.002 
1.000 


49.8018 
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TABLE B.6 (Continued) 
‘1% Significance Level 


TABLE B.7' 
SIGNIFICANCE POINTS FOR THE MODIFIED LIKELIHOOD RATIO TEST È = Zo 
Pr( – 2106 М > x) = 0.05 


4|. n 5% 59 19 |n 5 1% |n 5 1% 
5! 109 1.396 i nea 
= = =5 - 6 
6 | 1046 1.148 1.471 2 1350 188 256 |9 325 400 |12 409 49.0 
7, 108 1079 1186 (151 3 10.64 1682 2268 10 314 386 |13 400 478 
8| 1019 1.049 1.103 1.213 1.542 4 969 14 393 470 
9 | 1013 1.034 1067 1123 1234 1.556 5 922 1581 2123 |11 3055 375 |15 387 462 
10 | 1010 1.025 1.047 1.081 1138 1250 1519 2036 |12 2992 3672 
12] 106 1.015 1.027 1044 1068 1104 6 894 1477 1978|13 2942 36.09 |16 3822 45.65 
14 | 1004 1.010 1.018 1.028 1041 100 7 815 1447 1936|14 2902 3557|17 3181 45.13 
16| 100 1.007 1012 1019 1028 103 8 862 1424 1904|15 2868 35.15 |18 3745 44.70 
18 | 1002 1.005 1.009 1.014 1.020 1.028 9 8.52 19 3714 443 
20 | 1.002 1.004 1.007 1011 1015 101 10 8.44 1406 18.80 |16 2840 34.79 |20 3687 43.99 
4l Loot 109 1.008 оу imo len 13.92 1861|17 2815 3449|21 36.63 43.69 
28 | 100 — 102 — 1008 105 107 109 p=4 1380 1845 18 2794 3423 
7 258 13.70 18.31 |19 2776 3400|22 3641 43.43 
34 | 1001 1.001 1.002 1.003 1.004 100 8 2406 13.62 1820|20 2760 3379 |24 36.05 42.99 
42 | 1000 1.001 1.001 1.002 1.03 1.003 9 23.00 26 35.15 42.63 
50 | 1.000 1.001 1.001 1.001 1.002 1.002 10 2228 28 3549 4232 
100 1.000 1.000 1.000 1.000 1.000 1.001 30 3528 42.07 
x? | 15.0863 21.6660 291412 37.5662 469629 57.3421 И 2175 
12 2135 
13 21.03 
14 2077 
15 20.56 
р= 1 р = р= 9 р= 10 
18 486 584 671 |28 701 796 |34 (82.3) (924) 
19 482 577 663 130 694 788 |36 817 918 
20 477 57.09 65.68 38 812 912 
21 4734 5661 6512|32 688 7817 |40 807 907 
22 4700 34 6834 77.60 
5620 64.64 | 36 (67.91) (7108) 45 79.83 89.63 
24 4643 55.84 6423 | 38 (67.53) (76.65)| 50 79.13 88.83 
26 45.97 55.54 63.87 40 6721 7629|55 78.57 88.20 
28 45.58 5526 63.55 60 7813 87.68 
30 4525 55.03 63.28 | 45 66.54 75.31 65 77.75 87.26 
32 44.97 50 6602 7492 
34 4473 55 6561 74.44 |70 7744 86.89 


[80 65.28 74.06 | 75 77.18 86.59 


*Entries in parentheses have been interpolated or extrapolated into Korin's table. 
p = number of variates; № = number of observations; n = N — 1. А} = nlog|Zol ~ 
пр — nlog[S| + mtr(SZ5 1), where S is the sample covariance matrix. 
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cumulants of, 54 
kurtosis of, 54 
likelihood ratio criterion for equality of 
covariance matrices, asymptotic 
distribution of, 451 
likelihood ratio criterion for ir.denendence 
of sets, asymptotic distribution of, 406 
likelihood ratio criterion for linear 
hypotheses, asymptotic distribution of, 371 
maximum likelihood estimator of 
parameters, 104 
multiple correlation coefficient, asymptotic 
distribution of, 159 
rectangular coordinates, asymptotic 
distribution of, 283 
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Markov chain, 610 
Markov properties, 597 
globally, 600 
locally, 598 
pairwise, 597 
moral graph, 608 
nodes, 595 
parent, 605 
partial ordering, 605 
path, 600 
recursive factorization, 609 
Separate, 600 
vertices, 595 
well-numbered, 607 


Haar invariant distribution of orthogonal 
matrices, 162, 541 
conditional distribution, 542 
Hadamard's inequality, 61 
Head lengths and breadths of brothers, 109 
Hotelling's T?, see T?-test and statistic 
Hypergeometric function, 126 
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Incomplete beta function, 329 
Independence, 10 
mutual, 11 
of normal variables, 26 
of sample mean vector and sample 
covariance matrix, 77 
tests of, see Correlation coefficient; Muitiple 
correlation coefficient; Testing 
independence of sets of variates 
Information matrix, 85 
Integral of a symmetric unimodal function over 
a symmetric convex set, 365 
Intraclass correlation, 484 
Invariance, see Classification into normal 
populations; Correlation coefficient; 
Generalized variance; Linear hypothesis; 
Multiple correlation coefficient; Partial 
correlation coefficient; Т ?-test; Testing 
that a covariance matrix is a given matrix; 
Testing that a covariance matrix is 
proportional to a given matrix; Testing 
equality of covariance matrices; Testing 
equality of covariance matrices and means 
vectors; Testing independence of sets of 
variates 
Inverted Wishart distribution 272 
Iris, four measurements on, 110, 180 


Jacobian, 13 

James-Stein estimator, 91 
for arbitrary known covariance matrix, 97 
average mean squared error of, 95 


Kronecker delta, 75 

Kronecker product of matrices, 643 
characteristic roots of, 643 
determinant of, 643 

Kurtosis, 54 
estimation of, 103 


Latin square, 377 

Lawley-Hotelling trace criterion, see Linear 
hypothesis 

Least squares estimator, 295 

Likelihood, induced, 71 

Likelihood function for sample from 
multivariate normal distribution, 67 

Likelihood loss function for covariance matrix, 
276 

Likelihood ratio test, definition of, 129. See 
also Correlation coefficient; Linear 
hypothesis; Mean vector; Multiple 
correlation coefficient; Regression 
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coefficients; T 7-test; Testing that a 
covariance matrix is given matrix; Testing 
that a covariance matrix is proportional 
to given matrix; Testing that a covariance 
matrix and mean vector are equal to a 
given matrix and vector; Testing equality 
of covariance matrices; Testing equality 
of covariance matrices and mean vectors; 
Testing independence of sets of variates 


Lincar combinations of normal variables. 


distribution of, 29 


Linear equations, solution of, 606 


by Gaussian elimination, 607 


Linear functional relationship. 513 


relation to simultaneous equations, 520 


Linear hypothesis, testing of 


admissibility of, 353 
necessary condition for, 363 
Bartlett-Nanda-Pillai trace criterion, 331 
admissibility of, 379 
asymptotic expansion of distribution 
of, 333 
as Bayes procedure, 378 
table of significance points of, 673 
tabulation of power of, 333 
canonical form of, 303 
comparison of powers, 334 
invariance of criteria, 327 
Lawley-Hotelling trace criterion, 328 
admissibility of, 379 
asymptotic expansion of distribution 
of, 330 
monotonicity of power function of, 368 
table of significance points of, 657 
tabulation of, 328 
likelihood ratio criterion, 300 
admissibility of, 378 
asymptotic expansion of distribution 
of, 321 
as Bayes procedure, 378 
distributions of, 306, 310 
F-approximation to distribution of, 326 
geometric interpretation of, 302 
moments of, 309 
monotonicity of power function of, 368 
normal approximation to distribution 
of, 323 
table of significance points, 651 
tabulation of distribution of. 314 
Wilks’ A, 300 
monotonicity of power function of an 
invariant test, 363 
Roy’s maximum root criterion, 333 
distribution for p = 2, 334 
monotonicity of power function of. 368 
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Linear hypothesis (Continued) 
table of significance points, 677 
tablulation of distribution of, 333 
step-down test, 314 
See alsoRegression coefficients and function 
Linearly independent vectors, 627 
Linear transformation of a normal vector, 23, 
29.31 
Loss. 88 
LR decomposition, 630 


Mahalanobis distance, 80, 217 
sample, 228 
Majorization. 355 
weak. 355 
Marginal density, 9 
distribution, 9 
normal, 27 
Mathematical expectation, 9 
Matrix. 624 
bidiagonal. upper, 503 
characteristic roots and vectors of, see 
Characteristic roots and vectors 
cofactor in, 627 
convexity, 358 
definition of. 624 
diagonalization of symmetric, 631 
doubly stochastic, 646 А 
eigenvalue. see Characteristic roots and 
vectors 
Givens, 471, 649 
Householder. 470, 650 
idempotent. 635 
identity. 626 
inverse, 627 
minor of, 627 
nonsingular. 627 
operations with, 625 
positive definite, 628 
positive semidefinite, 628 
rank of. 628 
symmetric, 626 
trace of, 629 
transpose. 625 
triangular. 629 
tridiagonal, 470 
Matrix of sums of squares and cross-products 
of deviations from the means, 68 
Maximum likelihood estimators, see Canonical 
correlations and variates; Correlation 
coefficient; Covariance matrix; Mean 
vector, Multiple correlation coefficient; 
Partial correlation coefficient: Principal 
components: Regression coefficients; 
Variance 
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Maximum likelihood estimator of function of 
parameters, 71 
Maximum of the likelihood function, 70 
Maximum of variance of linear combinations, 
464. See also Principal components 
Mean vector, 17 
asymptotic normality of sample, 86 
completeness of sample as an estimator of' 
population, 84 
confidence region for difference of two 
when common covariance matrix is 
known, 80 
when covariance matrix is unknown, 
180 
consistency of sample as estimate of 
population, 86 
distribution of sample, 76 
efficiency of sample, 85 
improved cstimator when covariance matrix 
is unknown, 185 
maximum likelihood estimator of, 70 
sample, 67 
simultaneous confidence regions for linear 
functions of, 178 
testing equality of, in several distributions, 
206 
testing equality of two when common 
covariance matrix is known, 80 
tests of hypothesis about 
when covariance matrix is known, 80 
when covariance matrix is unknown, see 
T?-test 
See also James-Stein estimator 
Minimax, 90 
Missing observations, maximum likelihood 
estimators, 168 
Modulus, 13 
Moments, 9. 41 
factoring of, 11 
from marginal distributions, 10 
of normal distributions, 46 
Monotone region, 355 
in majorization, 355 
Multiple correlation coefficient 
adjusted, 153 
distribution of sample 
conditional, 154 
when population correlation is not 
zero, 156 
when population correlation is zero, 
150 
geometric interpretation of sample, 148 
invariance of population, 60 
invariance of sample, 166 
likelihood ratio test that it is zero, 151 
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as maximum correlation between one 
variable and lincar combination of other 
variables, 38 
maximum likelihood estimator of, 147 
moments of sample, 156 
optimal properties of, 157 
population, 38 
sample, 145 
tabulation of distribution of, 177 
Multivariate analysis of variance (MANOVA), 
346 
Latin square, 377 
one-way, 342 
two-way, 346 
See also Linear hypothesis, testing of 
Multivariate beta distribution, 377 
Multivariate of gamma function, 257 
Multivariate normal density, 20 
distribution, 20 
computation of, 23 
Multivariate t-distribution, 276, 289 


n(xl u, £), 20 
Ми, У), 20 
Neyman-Pearson fundamental lemma, 248 
Noncentral chi-squared distribution, 82 
Noncentral F-distribution, 186 

tables of, 186 
Noncentral 7?-distribution, 186 


O(N X p), 161 
Orthonormal vectors, 647 


Parallelotope, 266 
volume of, 266 

Partial correlation coefficient 
computational formulas for, 39, 40, 41 
confidence intervals for, 143 
distribution of sample, 143 
geometric interpretation of sample, 138 
invariance of population, 63 
invariance of sample, 166 
maximum likelihood estimator of, 138 
in the population, 35 
recursion formula for, 41 
sample, 138 
tests about, 144 

Partial covariance, 34 
estimator of, 137 

Partial variance, 34 

Partioning of a matrix, 635 
addition of, 635 
of a covariance matrix, 25 
determinant of, 637 
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inverse of, 638 
multiplication of, 635 
Partioning of a vector, 635 
of a mean vector, 25 
of a random vector, 24 
Path analysis, 596. See also Graphical models 
Pearson correlation coefficient, see 
Correlation coefficient 
Piane of closest fit, 466 
Polar coordinates, 285 
Positive definite matrix, 628 
Positive part of a function, 96 
of the James-Stein estimator. 07 
Positive semidefinite matrix, 628 
Precision matrix, 272 
unbiased estimator of, 274 
Principal axes of ellipsoids of constant density, 
465. See also Principal components 
Principal components, 459 
asymptotic distribution of sample, 473 
computation of, 469 
confidence region for, 475, 477 
distribution of sample, 540, 542 
maximum likelihood estimator of, 467 
population, 464 
testing hypotheses about, 478, 479, 480 
Probability element, 8 
Product-moment correlation coefficient, see 
Correlation coefficient 


QL algorithm, 471 t 
QR algorithm, 471 
decomposition, 647 
Quadratic form, 628 
Quadratic loss function for covariance matrix, 
276 


г, 71 

:# (real part), 257 

Random matrix, 16 
expected value of, 17 

Random vector, 16 

Randomized tcst, definition of, 192 

Rectangular coordinates, 257 
distribution of, 255, 257 

Reduced rank regression, 514 
estimator, asymptotic distribution of, 

550 

Regression сос(йсіспіѕ and function, 34 
confidence regions for, 339 
distribution of sample, 297 
geomctric interpretation of sample, 138 
maximum likelihood estimator of, 294 
partial correlation, connection with, 61 
sample, 294 
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Regression coefficients (Continued) 
simultaneous confidence intervals for, 340, 
341 ` 
testing hypotheses of rank of, 512 
testing they are zero, in case of one 
dependent variable, 152 
Residuals from regression, 37 
Risk function, 88 


Selection of linear combinations, 201 
Simple correlation coefficient, See Correlation 
coefficient 
Simultaneous equations, 513 
estimation of coefficients, 518 
least variance ratio (LVR), 519 
limited information maximum likelihood 
(LIML), 519 
two stage least squares (TSLS), 522 
identification by zeros, 516 
reduced form, 516 
estimation of, 517 
Singular normal distribution, 30, 31 
Singular value decomposition, 498, 634 
Spherical distribution, 105 
left, 105 
right, 105 
vector, 105 
Spherical normal distribution, 23 
Spherically contoured distribution 47 
stochastic representation, 49 
uniform distribution, 48 
Sphericity test, see Testing that a covariance 
matrix is proportional to a given matrix 
Standardized sum statistics, 201 
Standardized variable, 22 
Steifel manifold, 162 
Stochastic convergence, 113 
of a sequence of random matrices, 113 
Sufficiency, definition of, 83 
Sufficiency of sample mean vector and 
covariance matrix, 83 
Surface area of unit sphere, 286 
Surfaces o` constant density, 22 
Symmetric matrix, 626 


T?-statistic, 176. See або T?-test and statistic 
T?-test and statistic, 173 
admissibility of, 196 
as Bayes procedure, 199 
distribution of statistic, 176 
geometric interpretation of statistic, 174 
invariance of, 173 
as likelinood ratio test of mean vector, 176 
limiting distribution of, 176 
noncentral distribution of statistic, 186 
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optimal properties of, 190 
power of, 186 
tables of, 186 
for testing equality of means when 
covariance matrices are different, 187 
for testing equality of two mean vectors 
when covariance matrix is unknown, 179 
for testing symmetry in mean vector, 182 
as uniformly most powerful invariant test of 
mean vector, 191 
Testing that a covariance matrix is a given 
matrix, 438 
invariant tests of, 442 
likelihood ratio criterion for, 438 
modified likelihood ratio criterion for, 438 
asymptotic expansion of distribution of, 
442 
moments of, 440 
table of significance points, 685 
Nagao's criterion, 442 р 
Testing that a covariance matrix is 
proportional to a given matrix, 431 
invariant tests, 436 
likelihood ratio criterion for, 434 
admissibility, 458 
asymptotic expansion of distribution of, 
435 
moments of, 434 
table of significance points, 682 
Nagao's criterion, 437 
Testing that a covariance matrix and mean 
vector are equal to a given matrix and 
vector, 444 
likelihood ratio criterion for, 444 
asymptotic expansion of distribution of, 
446 
fnoments of, 445 
Testing equality of covariance matrices, 412 
invaríanct tests, 428 
likelihood ratio criterion for, 413 
invariance of, 414 
modified likelihood ratio criterion for, 413 
admissibility of, 449 
asymptotic expansion of distribution of, 
425 
distribution of, 420 
moments of, 422 
table of significance points, 681 
Nagao's criterion for, 415 
Testing equality of covariance matrices and 
mean vectors, 415 
likelihood ratio criterion for, 415 
asymptotic expansion of distribution of, 
426 
distribution of, 421 
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moments of, 422 
unbiasedness of, 416 
Testing independence of sets of variates, 
381 
and canonical correlations, 504 
likelihood ratio criterion for, 384 
admissibility of, 401 
asymptotic expansion of distribution of, 
390 
distribution of, 388 
invariance of, 386 
moments of, 388 
monotonicity of power function of, 404 
unbiasedness of, 386 
Nagao's test, 392 
asymptotic expansion of distribution of, 
| 392 
stepdown tests, 393 
Testing rank of regression matrices, 512 
Tests of hypotheses, see Correlation 
coefficient; Generalized analysis of 
variance; Linear hypothesis; Mean vector; 
Multiple correlation coefficient; Partial 
correlation coefficient; Regression 
coefficients; T?-test and statistic 
Tetrachoric functions, 23 
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Total correlation coefficient. see Correlation 
coefficient 

T.ace of a matrix, 629 

Transformation of variables. 12 


Unbiased estimator, definition of. 77 

Unibased test, definition of, 364 

Uniform distribution on unit sphere, 48 
on O(N X p), 162 


Variance, 17 
generalized, see Generalized variance 
maximum likelihood estimator of. 71 


w(A|X, п), 255 


W(X, п), 255 
w^! (BIN, т), 272 
w^! (WV, т), 272 


Wishart distribution, 256 
characteristic function of. 258 
geometric interpretation of, 256 
marginal distributions of, 260 
noncentral, 587 
for p = 2, 124 


z, see Fisher’s z 
Zonal polynomials, 473 
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